tidycensus::get_acs(geography = “county”, variables = “B01003_001”) will get you the latest 2016-2020 ACS estimates
https://walker-data.com/census-r/wrangling-census-data-with-tidyverse-tools.html
Syllabus
Course Title: GIS5122: Applied Spatial Statistics
Contact information
- Instructor Name: Professor James B. Elsner
- Instructor Location: Bellamy Building, Room 323a
- Lesson Hours: TR 8:00-9:15 p.m.
- Student Hours: TR 9:15-10:30 a.m., 2-3 p.m.
Email: jelsner@fsu.edu
Links to my professional stuff (if you are curious)
Course description and expected learning outcomes
This course is for students who want to learn how to analyze, map, and model spatial and geographical data using the R programming language. It assumes that students know basic statistics through multiple linear regression. And that students have some prior experience with using R. Students without any knowledge of R should look various online tutorials (see below).
In this course you will get a survey of the methods used to describe, analyze, and model spatial data. Focus will be on applications. Emphasis is given to how spatial statistical methods are related through the concept of spatial autocorrelation.
Expected learning outcomes
- Learn how and when to apply statistical methods and models to spatial data,
- learn various packages in R for analyzing and modeling spatial data, and
- learn how to interpret the results of a spatial data model.
The course offers a programming approach to exposing you to spatial statistics. I want to demystify the process and give you confidence that you can analyze and fit spatial models. I believe some investment in honing programming skills will pay dividends for you later on.
But in taking this approach I don’t want to give you the false impression that statisticians have the answers. A working knowledge of the model fitting process needs to be combined with a good understanding of the context in which you are working.
Materials and class meetings
Access to the internet and a computer
Lesson and assignment files on GitHub
No textbook is required
Many excellent online resources are available. Here are some of my favorites
Class meetings
During each lesson I will work through and explain the R code and notes contained within an xx-Lesson.Rmd file. The notes in the lesson files are comprehensive, so you can work through them on your own if you are unable to make it to class.
Notes are written using the markdown language. Markdown is a way to write content for the Web. An R markdown file has the suffix .Rmd (R markdown file). The file is opened using the RStudio application.
Grades and ethics
You are responsible for:
- Reading and running the code in the lesson R markdown files (
.Rmd) files. You can do this during the remote lessons as I talk and run my code or outside of class on your own - Completing and returning the lab assignments on time
Grades are determined by how well you do on the assignments using the following standard:
- A: Outstanding: few, in any, errors/omissions
- B: Good: only minor errors/omissions
- C: Satisfactory: minor omissions, at least one major error/omission
- D: Poor: several major errors/omissions
- F: Fail: many major errors/omissions
I’ll use the +/- grading system.
Grades will be posted as they are recorded on FSU Canvas
Academic honor code
https://fda.fsu.edu/academic-resources/academic-integrity-and-grievances/academic-honor-policy
Americans With Disabilities Act
Students with disabilities needing academic accommodation should: (1) register with and provide documentation to the Student Disability Resource Center; (2) bring a letter indicating the need for accommodation and what type. This should be done during the first week of classes.
Diversity and inclusiveness
It is my intent to present notes and data that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture.
Outline of topics and schedule
- Working with data and making graphs (~ 4 lessons)
- Working with spatial data and making maps (~ 5 lessons)
- Quantifying spatial autocorrelation and spatial regression (~ 5 lessons)
- Analyzing and modeling point pattern data (~ 6 lessons)
- Estimating variograms and interpolating spatially (~ 6 lessons)
- Other topics (~ 2 lessons)
| No. of Weeks | Dates | Topic |
|---|---|---|
| 2 | August 23, 25 | Syllabus and setup |
| 2 | August 30, September 1 | Data frames |
| 3 | September 6, 8 | |
| 4 | September 13, 15 | |
| 5 | September 20, 22 | |
| 6 | September 27, 29 | |
| 7 | October 4, 6 | |
| 8 | October 11, 13 | |
| 9 | October 18, 20 | |
| 10 | October 25, 27 | |
| 11 | November 1, 3 | |
| 12 | November 8, 10 | |
| 13 | November 15, 17 | |
| 14 | November 29, December 1 |
28 dates 23 lesson days + 5 lab days
| Lab | Date | Lessons |
|---|---|---|
| 1 | Tuesday September 6 | |
| 2 | Thursday September 22 | |
| 3 | Thursday October 13 | |
| 4 | Tuesday November 8 | |
| 5 | Thursday December 1 |
Reference materials
- Bivand, R. S., E. J. Pebesma, and V. G. Gomez-Rubio, 2013: Applied Spatial Data Analysis with R, 2nd Edition, Springer. A source for much of the material in the lesson notes.
- Lovelace, R. Nowosad, J. and Muenchow, J. Geocomputation with R. https://geocompr.robinlovelace.net/ A source for some of the material in the lesson notes.
- Healy, K., 2018: Data Visualization: A practical introduction, https://socviz.co/. This book teaches you how to really look at your data. A source for some of the early material in the lesson notes.
- Waller, L. A., and C. A. Gotway, 2004: Applied Spatial Statistics for Public Health Data, John Wiley & Sons, Inc. (Available as an e-book in the FSU library). Good overall reference material for analyzing and modeling spatial data.
- Analyzing US Census Data: Methods, Maps, and Models in R https://walker-data.com/census-r/index.html
- Cheat Sheets: https://rstudio.com/resources/cheatsheets/
- R Cookbook: How to do specific things: https://rc2e.com/
- R for Geospatial Processing: https://bakaniko.github.io/FOSS4G2019_Geoprocessing_with_R_workshop/
- Spatial Data Science: https://keen-swartz-3146c4.netlify.com/
Maps/graphs
- Inset maps: https://geocompr.github.io/post/2019/ggplot2-inset-maps/
- {cartography} package in R: https://riatelab.github.io/cartography/docs/articles/cartography.html
- geovisualization with {mapdeck}: https://spatial.blog.ryerson.ca/2019/11/21/geovis-mapdeck-package-in-r/
- 3D elevation with {rayshader}: https://www.rayshader.com/
- 3D elevation to 3D printer: https://blog.hoxo-m.com/entry/2019/12/19/080000
- Accelerate your plots with {ggforce}: https://rviews.rstudio.com/2019/09/19/intro-to-ggforce/
- Summary statistics and ggplot: https://ggplot2tutor.com/summary_statistics/summary_statistics/
Space-time statistics
- Space-time Bayesian modeling package: https://cran.r-project.org/web/packages/spTimer/spTimer.pdf
- Working with space-time rasters: https://github.com/surfcao/geog5330/blob/master/week12/raster.Rmd
Bayesian models
- Bayesian Linear Mixed Models: Random intercepts, slopes and missing data: https://willhipson.netlify.com/post/bayesian_mlm/bayesian_mlm/
- Doing Bayesian Data Analysis in {brms} and the {tidyverse}: https://bookdown.org/ajkurz/DBDA_recoded/
- Spatial models with INLA: https://becarioprecario.bitbucket.io/inla-gitbook/index.html
- Geospatial Health Data: Modeling and Visualization with {RINLA} and {shiny}: https://paula-moraga.github.io/book-geospatial/
- Bayesian workflow: https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#1_questioning_authority
Spatial data
- Progress in the R ecosystem for representing and handling spatial data https://link.springer.com/article/10.1007/s10109-020-00336-0
- Google earthengine: https://earthengine.google.com/
- Burden of roof: Revisiting housing costs with {tidycensus}: https://austinwehrwein.com/data-visualization/housing/
- The Care and Feeding of Spatial Data: https://docs.google.com/presentation/d/1BHlrSZWmw9tRWfYFVsRLNhAoX6KzhOhsnezTqL-R0sU/edit#slide=id.g6aeb55b281_0_550
- Accessing remotely sensed imagery: https://twitter.com/mouthofmorrison/status/1212840820019208192/photo/1
- Spatial data sets from Brazil: https://github.com/ipeaGIT/geobr
Machine learning
- Supervised machine learning case studies: https://supervised-ml-course.netlify.com/
- Machine learning for spatial prediction: https://www.youtube.com/watch?v=2pdRk4cj1P0&feature=youtu.be
- Machine learning on spatial data: https://geocompr.robinlovelace.net/spatial-cv.html
Spatial networks
- Spatial Networks in R with {sf} and {tidygraph}: https://www.r-spatial.org/r/2019/09/26/spatial-networks.html
- Travel times/distances: https://github.com/rCarto/osrm
- Making network graphs in R - {ggraph} and {tidygraph} introduction https://youtu.be/geYZ83Aidq4
Transport planning/routing
https://docs.ropensci.org/stplanr/index.html https://www.urbandemographics.org/post/r5r-fast-multimodal-transport-routing-in-r/
Time series forecasting
https://weecology.github.io/MATSS/
Movement
https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/1365-2656.13116
Bookdown
- Introduction: https://bookdown.org/yihui/bookdown/introduction.html
- Learning more: https://ropensci.org/blog/2020/04/07/bookdown-learnings/
Climate data
https://cran.r-project.org/web/packages/climate/vignettes/getstarted.html https://www.ncdc.noaa.gov/teleconnections/enso/indicators/soi/data.csv USGS water data: https://waterdata.usgs.gov/blog/dataretrieval/
Reference books
- Anselin, L., 2005: Spatial Regression Analysis in R, Spatial Analysis Laboratory, Center for Spatially Integrated Social Science.
- Baddeley, A., and R. Turner, 2005: spatstat: An R Package for Analyzing Spatial Point Patterns, Journal of Statistical Software, v12.
- Blangiardo, M., and M. Cameletti, 2015: Spatial and Spatio-temporal Bayesian Models with R-INLA, John Wiley & Sons, Inc., New York. An introduction to Bayesian models for spatial data.
- Cressie, N. A. C., 1993: Statistics for Spatial Data, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, Inc., New York. A mathematical treatment of spatial data analysis.
- Cressie, N. A. C., and C. K. Wikle, 2011: Statistics for Spatio-Temporal Data, Wiley Series in Probability and Mathematical Statistics, John Wiley & Sons, Inc., New York. A mathematical treatment of space-time statistics with an emphasis on Bayesian models.
- Diggle, P. J., 2003: Statistical Analysis of Spatial Point Patterns, Second Edition, Arnold Publishers. An introduction to the concepts and methods of statistical analysis of spatial point patterns.
- Fotherhingham, A. S., C. Brunsdon, and M. Charlton, 2000: Quantitative Geography: Perspectives on Spatial Data Analysis, SAGE Publications, London. A survey of spatial data analysis from the perspective of modern geography.
- Haining, R., 2003: Spatial Data Analysis: Theory and Practice, Cambridge University Press. A confluence of geographic information science and applied spatial statistics.
- Illian, J., A. Penttinen, H. Stoyan, and D. Stoyan, 2008: Statistical Analysis and Modeling of Spatial Point Patterns, Wiley Series in Statistics in Practice, John Wiley & Sons, Inc., New York. A mathematical treatment of spatial point processes.
- Ripley, B. D., 1981: Spatial Statistics, Wiley, New York. A reference book on spatial data analysis with emphasis on point pattern analysis.
- Wickham, H., 2009: ggplot2: Elegant Graphics for Data Analysis, Springer UseR! Series, Springer, New York. An introduction to the ggplot package for graphics.
Recent research examples
Reproducible research
A scientific paper has at least two goals: announce a new result and convince readers that the result is correct. Scientific papers should describe the results and provide a clear protocol to allow repetition and extension.
Analysis and modeling tools should integrate text with code to make it easier to provide a clear protocol of what was done.
- Such tools make doing research efficient. Changes are made with little effort.
- Such tools allow others to build on what you’ve done. Research achieves more faster.
- Collaboration is easier.
- Code sharing leads to greater research impact. Research impact leads to promotion & tenure.
Free and open source software for geospatial data has progressed at an astonishing rate. High performance spatial libraries are now widely available.
However, much of it is still not easy to script. Open source Geographic Information Systems (GIS) like QGIS (see https://qgis.org) have greatly reduced the ‘barrier to entry’ but emphasis on the graphical user interface (GUI) makes reproducible research difficult.
Instead here we will focus on a command line interface (CLI) to help you create reproducible work flows.
You might be interested in this article: Practical reproducibility in geography and geosciences
Tuesday, August 23, 2022
“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” — Martin Fowler
Today
What this course is about
Details about lessons, assignments, and grading
How to get the most out of this course
Is Milwaukee snowier than Madison?
Is global warming making hurricanes stronger?
Are tornadoes more likely to form over smooth terrain?
Understand what this course is about, how it is structured, and what I expect from you
Getting set to work with R and RStudio
Install R and RStudio on your computer
First get R
- Go to http://www.r-project.org
- Select the CRAN (Comprehensive R Archive Network). Scroll to a mirror site
- Choose the appropriate file for your computer
- Follow the instructions to install R
Then get RStudio
- Go to on http://rstudio.org
- Download RStudio Desktop
- Install and open RStudio
Finally (not required for success in this class), learn git with R
Download course materials
- Navigate to <[https://github.com/jelsner/ASS-2022\\](https://github.com/jelsner/ASS-2022){.uri}>
- Click on the bright green Code button
- Download ZIP
- Unzip the file on your computer
- Open the
ASS-2022.Rprojfile
About RStudio
Written in HTML (like your Web browser)
Top menus
- File > New File > R Markdown
- Tools > Global Options > Appearance
Upper left panel is the markdown file. This is where you put your text and code
- Run code chunks from this panel
- Output from the operations can be placed in this panel or in the Console (see the gear icon above)
- All the text, code, and output can be rendered to an HTML file or a PDF or Word document (see the Knit button above)
Upper right panel shows what is in your current environment and the history of the commands you issued
- This is also where you can connect to github
Lower left panel is the Console
- I think of this as a sandbox where you try out small bits of code. If it works and is relevant to what you want to do you move it to the markdown file
- This is also where output from running code will be placed
- Not a place for plain text
Lower right panel shows your project files, the plots that get made, and all the packages associated with the project
- The File tab shows the files in the project. The most important one is the .Rmd.
- The Plot tab currently shows a blank sheet
- The Packages tab shows all the packages that have been downloaded from CRAN and are associated with this project
Lab assignments
You will do all assignments inside a Rmd file.
- Get the assignment
Rmdfile from github and rename it toyourLastName_yourFirstName.Rmd - Open the
Rmdfile with RStudio - Replace ‘Your Name’ with your name in the preamble (YAML)
- Answer the questions by typing appropriate code between the code-chunk delimiters
- Select the Knit button to generate an HTML file
- Fix any errors
- Email your completed assignment
Rmdfile to jelsner@fsu.edu
Getting started with R
Applied statistics is the analysis and modeling of data. Use the c() function to input small bits of data into R. The function combines (concatenates) items in a list together.
For example, consider a set of hypothetical annual land falling hurricane counts over a ten-year period.
2 3 0 3 1 0 0 1 2 1
You save these 10 integer values in your working directory by typing them into the console as follows. The console is the lower left window.
counts <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)
counts## [1] 2 3 0 3 1 0 0 1 2 1
You assign the values to an object called counts. The assignment operator is an equal sign (<- or =). Values do not print. They are assigned to an object name. They are printed by typing the object name as we did on the second line. When printed the values are prefaced with a [1]. This indicates that the object is a vector and the first entry in the vector has a value of 2 (The number immediately to the right of [1]).
Use the arrow keys to retrieve previous commands. Each command is stored in the history file. The up-arrow key moves backwards through the history file. The left and right arrow keys move the cursor along the line.
Then you apply functions to data stored in an object.
sum(counts)## [1] 13
length(counts)## [1] 10
sum(counts)/length(counts)## [1] 1.3
mean(counts)## [1] 1.3
The function sum() totals the number of hurricanes over all ten years, length() gives the number of elements in the vector. There is one element (integer value) for each year, so the function returns a value of 10.
Other functions include sort(), min(), max(), range(), diff(), and cumsum(). Try these functions on the landfall counts. What does the range() function do? What does the function diff() do?
diff(counts)## [1] 1 -3 3 -2 -1 0 1 1 -1
The hurricane count data stored in the object counts is a vector. This means that R keeps track of the order that the data were entered. There is a first element, a second element, and so on. This is good for several reasons.
The vector of counts has a natural order; year 1, year 2, etc. You don’t want to mix these. You would like to be able to make changes to the data item by item instead of entering the values again. Also, vectors are math objects so that math operations can be performed on them.
For example, suppose counts contain the annual landfall count from the first decade of a longer record. You want to keep track of counts over other decades. This is done here as follows.
d1 <- counts
d2 <- c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1)Most functions operate on each element of the data vector at the same time.
d1 + d2## [1] 2 8 4 5 4 0 3 4 4 2
The first year of the first decade is added from the first year of the second decade and so on.
What happens if you apply the c() function to these two vectors? Try it.
c(d1, d2)## [1] 2 3 0 3 1 0 0 1 2 1 0 5 4 2 3 0 3 3 2 1
If you are interested in each year’s count as a difference from the decade mean, you type
d1 - mean(d1)## [1] 0.7 1.7 -1.3 1.7 -0.3 -1.3 -1.3 -0.3 0.7 -0.3
In this case a single number (the mean of the first decade) is subtracted from a vector. The result is from subtracting the number from each entry in the data vector.
This is an example of data recycling. R repeats values from one vector so that the vector lengths match. Here the mean is repeated 10 times.
Are you completely new to R?
The {swirl} package contains functions to get you started with the basics of R. To install the package use the install.packages() function with the name of the package in quotes. The function downloads the package from the Comprehensive R Archive Network (CRAN). You update packages using update.packages() function.
To make the functions work in your current session use the library() function with the name of the package (without quotes). This needs to be done for every session, but only once per session.
install.packages("swirl")
library(swirl)Type:
swirl()Choose the lesson: R Programming. Work through lessons 1:8
Getting help: https://www.r-project.org/help.html
Thursday, August 25, 2022
“The trouble with programmers is that you can never tell what a programmer is doing until it’s too late.” — Seymour Cray
Today
- Expectations
- Data science workflow with R markdown
- An introduction to using R
- Data frames
Expectations
Lesson Hours: Mon/Wed 9:05 a.m. - 9:55 a.m., Lab Hours: Fri 9:05 a.m. - 9:55 a.m., Student Hours: Mon/Wed 9:55 a.m. - 10:30 a.m. The best way to contact me is through email: jelsner@fsu.edu.
This course is a survey of methods to describe, analyze, and model spatial data using R. Focus is on applications. I emphasize how spatial statistical methods are related through the concept of spatial autocorrelation.
During each lesson I will work through and explain the R code within an xx-Lesson.Rmd file. The notes in the files are comprehensive, so you can work through them on your own. The notes are written using the markdown language.
Grades are determined by how well you do on the weekly assignments.
There are online sites dedicated to all aspects of the R programming language. A list of some of the ones related to spatial analysis and modeling are in the syllabus.
You should now be set up with R and RStudio. If not I will help you after class. I will spend the first several lessons teaching you how to work with R. For some of you this material might be a review.
On the other hand, if this is entirely new don’t get discouraged. This class does not involve writing complex code.
Today I review how to work with small bits of data using functions from the {base} packages. The {base} packages are included in your installation. They form the scaffolding for working with the code, but much of what you will do in this class involve functions from other packages.
The one exception is that I introduce functions from the {readr} package today that simplify getting data into R. These functions are similar to the corresponding functions in the {base} package.
Data science workflow with R markdown
A scientific paper is advertisement for a claim about the world. The proof is the procedure that was used to obtain the result that under girds the claim. The computer code is the exact procedure.
Computer code is the recipe for what was done. It is the most efficient way to communicate precisely the steps involved. Communication to others and to your future self.
When you use a spreadsheet, it’s hard to explain to someone precisely what you did. Click here, then right click here, then choose menu X, etc. The words you use to describe these types of procedures are not standard.
If you’ve ever made a map using GIS you know how hard it is to make another (even similar one) with a new set of data. Running code with new data is simple.
Code is an efficient way to communicate because all important information is given as plain text without ambiguity. Being able to code is a key skill for most technical jobs.
The person most likely to reproduce our work a few months later is us. This is especially true for graphs and figures. These often have a finished quality to them as a result of tweaking and adjustments to the details. This makes it hard to reproduce later. The goal is to do as much of this tweaking as possible with the code we write, rather than in a way that is invisible (retrospectively). Contrast editing an image in Adobe Illustrator.
In data science we toggle between:
Writing code: Code to get our data into R, code to look at tables and summary statistics, code to make graphs, code to compute spatial statistics, code to model and plot our results.
Looking at output: Our code is a set of instructions that produces the output we want: a table, a model, or a figure. It is helpful to be able to see that output.
Taking notes: We also write text about what we are doing, why we are doing it, and what our results mean.
To do be efficient we write our code and our comments together in the same file. This is where R markdown comes in (files that end with .Rmd). An R markdown file is a plain text document where text (such as notes or discussion) is interspersed with pieces, or chunks, of R code. When we Knit this file the code is executed (from the top to the bottom of the file) and the results supplement or replace the code with output.
The resulting file is converted into a HTML, PDF, or Word document. The text in the markdown file they has simple format instructions. For example, the following symbols are used for emphasis italics, bold, and code font. When we create a new markdown document in R Studio, it contains a sample example.
Lesson notes for this class are written in text using markdown formatting as needed. Text is interspersed with code. The format for code chunks is
# lines of code hereThree back-ticks (on a U.S. keyboard, the character under the escape key) followed by a pair of curly braces containing the name of the language we are using. The back-ticks-and-braces part signal that code is about to begin. We write our code as needed, and then end the chunk with a new line containing three more back-ticks. We can use the Insert button above to save time.
In the markdown file, the lines between the first and second set of back ticks is grayed and a few small icons are noted in the upper-right corner of the grayed area. The green triangle is used to execute the code and either post the results in the console below or in the line below.
When we keep our notes in this way, we are able to see everything together, the code, the output it produces, and our commentary or clarification on it. Also we can turn it into a good-looking document with one click. This is how we will do everything in this course.
For example, select the Knit button above.
Finally, note the Outline button in the upper right corner of the markdown file. We can organize and navigate through the markdown file section by section based on the pound symbol (#).
An introduction to using R
Applied spatial statistics is the analysis and modeling of data that was collected across space. To begin you need to know about data objects.
The c() function is used to create a simple data object (vector object). The function combines (concatenates) individual values into a vector. The length of the vector is the number of data values.
Consider a set of annual land falling hurricane counts over a ten-year period. In the first year there were two hurricanes, the next year there were three, and so on.
2 3 0 3 1 0 0 1 2 1
You save these ten values by assigning them to an object that you call counts. The assignment operator is an equal sign (<- or =).
counts <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1)By clicking on the Environment tab in the upper-right panel you see that the object counts with numerical values (num) 2 3, etc below word Values. The elements of the vector object are indexed between 1 and 10 (1:10).
You print the values to the console by typing the name of the data object.
counts## [1] 2 3 0 3 1 0 0 1 2 1
When printed the values are prefaced with a [1]. This indicates that the object is a vector and the first element in the vector has a value of 2 (The number immediately to the right of [1]).
Note: You can assign and print by wrapping the entire line of code in parentheses.
( counts <- c(2, 3, 0, 3, 1, 0, 0, 1, 2, 1) )## [1] 2 3 0 3 1 0 0 1 2 1
You can use the arrow keys on your keyboard to retrieve previous commands. Each command is stored in the history file (click on the History tab in the upper-right panel). The up-arrow key moves backwards through the history file. The left and right arrow keys move the cursor along the line.
You apply functions to data objects. A function has a name and parentheses. Inside the parentheses are the function arguments. Many functions have only a single argument, the data object.
sum(counts)## [1] 13
length(counts)## [1] 10
sum(counts)/length(counts)## [1] 1.3
mean(counts)## [1] 1.3
The function sum() totals the hurricane counts over all years, length() returns the number of elements in the vector. Other functions include sort(), min(), max(), range(), diff(), and cumsum().
The object counts that you create is a vector in the sense that the elements are ordered. There is a first element, a second element, and so on. This is good for several reasons.
The hurricane counts have a chronological order: year 1, year 2, etc and you want that ordered reflected in the data object. Also, you would like to be able to make changes to the data values by element. Also, vectors are math objects so that math operations can be performed on them in a natural way.
For example, math tells us that a scalar multiplied by a vector is a vector where each element of the product has been multiplied by the scalar. The asterisk * is used for multiplication.
10 * counts## [1] 20 30 0 30 10 0 0 10 20 10
Further, suppose counts contain the annual landfall count from the first decade of a longer record. You want to keep track of counts over other decades.
d1 <- counts
d2 <- c(0, 5, 4, 2, 3, 0, 3, 3, 2, 1)Most functions operate on each element of the data vector all at once.
d1 + d2## [1] 2 8 4 5 4 0 3 4 4 2
The first year of the first decade is added to the first year of the second decade and so on.
What happens if you apply the c() function to these two vectors?
c(d1, d2)## [1] 2 3 0 3 1 0 0 1 2 1 0 5 4 2 3 0 3 3 2 1
You get a vector with elements from both d1 and d2 in the order of first the first decade counts and then the second decade counts.
If you are interested in each year’s count as a difference from the average number over the decade you type
d1 - mean(d1)## [1] 0.7 1.7 -1.3 1.7 -0.3 -1.3 -1.3 -0.3 0.7 -0.3
In this case a single number (the average of the first decade) is subtracted from each element of the vector.
Suppose you are interested in the inter annual variability in the set of landfall counts. The variance is computed as \[ \hbox{var}(x) = \frac{(x_1 - \bar x)^2 + (x_2 - \bar x)^2 + \cdots + (x_n - \bar x)^2}{n-1} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2 \]
Although the var() function computes this, here you see how to do this using simple functions. The key is to find the squared differences and then sum.
x <- d1
xbar <- mean(x)
x - xbar## [1] 0.7 1.7 -1.3 1.7 -0.3 -1.3 -1.3 -0.3 0.7 -0.3
(x - xbar)^2## [1] 0.49 2.89 1.69 2.89 0.09 1.69 1.69 0.09 0.49 0.09
sum((x - xbar)^2)## [1] 12.1
n <- length(x)
n## [1] 10
sum((x - xbar)^2)/(n - 1)## [1] 1.344444
var(x)## [1] 1.344444
Elements in a vector object must all have the same type. This type can be numeric, as in counts, character strings, as in
simpsons <- c('Homer', 'Marge', 'Bart', 'Lisa', 'Maggie')
simpsons## [1] "Homer" "Marge" "Bart" "Lisa" "Maggie"
Character strings are made with matching quotes, either double, ", or single, '. If you mix types the values will be coerced into a common type, which is usually a character string. Arithmetic operations do not work on character strings.
Returning to the land falling hurricane counts. Now suppose the National Hurricane Center (NHC) reanalyzes a storm, and that the 6th year of the 2nd decade is a 1 rather than a 0 for the number of landfalls. In this case you change the sixth element to have the value 1.
d2[6] <- 1You assign to the 6th year of the decade a value of one. The square brackets [] are used to reference elements of the data vector.
It is important to keep this straight: Parentheses () are used by functions and square brackets [] are used by data objects.
d2## [1] 0 5 4 2 3 1 3 3 2 1
d2[2]## [1] 5
d2[-4]## [1] 0 5 4 3 1 3 3 2 1
d2[c(1, 3, 5, 7, 9)]## [1] 0 4 3 3 2
The first line prints all the elements of the vector df2. The second prints only the 2nd value of the vector. The third prints all but the 4th value. The fourth prints the values with odd element numbers.
To create structured data, for example the integers 1 through 99 you can use the : operator.
1:99
rev(1:99)
99:1The seq() function is more general. You specify the sequence interval with the by = or length = arguments.
seq(from = 1, to = 9, by = 2)## [1] 1 3 5 7 9
seq(from = 1, to = 10, by = 2)## [1] 1 3 5 7 9
seq(from = 1, to = 9, length = 5)## [1] 1 3 5 7 9
The rep() function is used to create repetitive sequences. The first argument is a value or vector that we want repeated and the second argument is the number of times you want it repeated.
rep(1, times = 10)## [1] 1 1 1 1 1 1 1 1 1 1
rep(simpsons, times = 2)## [1] "Homer" "Marge" "Bart" "Lisa" "Maggie" "Homer" "Marge" "Bart"
## [9] "Lisa" "Maggie"
In the second example the vector simpsons containing the Simpson characters is repeated twice.
To repeat each element of the vector use the each = argument.
rep(simpsons, each = 2)## [1] "Homer" "Homer" "Marge" "Marge" "Bart" "Bart" "Lisa" "Lisa"
## [9] "Maggie" "Maggie"
More complicated patterns can be repeated by specifying pairs of equal length vectors. In this case, each element of the first vector is repeated the corresponding number of times specified by the element in the second vector.
rep(c("long", "short"), times = c(2, 3))## [1] "long" "long" "short" "short" "short"
To find the maximum number of landfalls during the first decade you type
max(d1)## [1] 3
What years had the maximum?
d1 == 3## [1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Notice the double equals signs (==). This is a logical operator that tests each value in d1 to see if it is equal to 3. The 2nd and 4th values are equal to 3 so TRUEs are returned.
Think of this as asking R a question. Is the value equal to 3? R answers all at once with a vector of TRUE’s and FALSE’s.
What years had fewer than 2 hurricanes?
d1 < 2## [1] FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
Now the question is how do you get the vector element corresponding to the TRUE values? That is, which years have 3 landfalls?
which(d1 == 3)## [1] 2 4
The function which.max() can be used to get the first maximum.
which.max(d1)## [1] 2
You might also want to know the total number of landfalls in each decade and the number of years in a decade without a landfall. Or how about the ratio of the mean number of landfalls over the two decades.
sum(d1)## [1] 13
sum(d2)## [1] 24
sum(d1 == 0)## [1] 3
sum(d2 == 0)## [1] 1
mean(d2)/mean(d1)## [1] 1.846154
So there are 85% more landfalls during the second decade. Is this difference statistically significant?
To remove an object from the environment use the rm() function.
rm(d1, d2)Data frames
Spatial data frames will be used throughout this course. A spatial data frame is a data frame plus information about the spatial geometry. Let’s start with data frames.
A data frame stores data in a tabular format like a spreadsheet. It is a list of vectors each with the same length. It has column names (and sometimes row names).
For example, you create a data frame object df containing three vectors n, s, b each with three elements using the data.frame() function.
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc")
b <- c(TRUE, FALSE, TRUE)
df <- data.frame(n, s, b)To see that the object is indeed a data frame you use the class() function with the name of the object inside the parentheses.
class(df)## [1] "data.frame"
The object df is of class data.frame. Note that the object name shows up in our Environment under Data and it includes a little blue arrow indicating that you can view it by clicking on the row.
The data frame shows up as a table (like a spreadsheet) in the View() mode (see the command in the console below). Caution: This is not advised for large data frames.
The top line of the table is called the header. Each line below the header contains a row of data, which begins with the name (or number) of the row followed by the data values.
Each data element is in a cell. To retrieve a data value from a cell, you enter its row and column coordinates in that order in the single square bracket [] operator and separated by a column.
Here is the cell value from the first row, second column of df.
df[1, 2]## [1] "aa"
You can print the column names (located in the top row in the View() mode) with the names() function.
names(df)## [1] "n" "s" "b"
The list of names is a vector of length three containing the elements “n”, “s”, and “b” in that order.
You access individual columns of a data frame as vectors by appending the dollar sign ($) to the object name. For example, to print the values of the column labeled s type
df$s## [1] "aa" "bb" "cc"
Many of the packages we will use this semester include example data frames. The data frame called mtcars, for instance, contains information extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
class(mtcars)## [1] "data.frame"
names(mtcars)## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
The number of data rows and data columns in the data frame are printed using the nrow() and ncol() functions.
nrow(mtcars)## [1] 32
ncol(mtcars)## [1] 11
Further details of built-in data frames like mtcars is available in the documentation accessed with the help() (or ?) function.
help(mtcars)If you type the name of the data frame in the console all the data are printed.
mtcars## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Instead, to get a glimpse of our data we used the functions head(), which prints the first six rows, or str(), which lists all the columns by data type.
head(mtcars)## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars)## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Tuesday August 30, 2022
“When I’m explaining some of the tidy verse principles and philosophy in R statistics, I often break down a home baked chunk of code and illustrate that ‘it says what it does and it does what it says.’ — Diane Beldame
Today
- Working with data frames
Working with data frames
Consider the data frame studentdata from the {LearnBayes} package. To access this data frame, you first install the package with the install.packages() function. You put the name of the package {LearnBayes} in quotes (single or double). Then to make the functions from the package available to your current session use the library() function with the name of the package (unquoted) inside the parentheses.
if(!require(LearnBayes)) install.packages(pkgs = "LearnBayes", repos = "http://cran.us.r-project.org")## Loading required package: LearnBayes
library(LearnBayes)Note: The argument repos = in the install.packages() function directs where the package can be obtained on CRAN (comprehensive R archive network). The CRAN repository is set automatically when using RStudio and you can install packages by clicking on Packages > Install in the lower-right panel.
For interactive use you need to specify the repository and when you use the Knit button you don’t want to install packages that already exist on your computer so you add the conditional if() function that says “only install the package IF it is not (!) available”.
Make a copy of the data frame by assigning it to an object with the name df and print the first six rows using the head() function.
df <- studentdata
head(df)## Student Height Gender Shoes Number Dvds ToSleep WakeUp Haircut Job Drink
## 1 1 67 female 10 5 10 -2.5 5.5 60 30.0 water
## 2 2 64 female 20 7 5 1.5 8.0 0 20.0 pop
## 3 3 61 female 12 2 6 -1.5 7.5 48 0.0 milk
## 4 4 61 female 3 6 40 2.0 8.5 10 0.0 water
## 5 5 70 male 4 5 6 0.0 9.0 15 17.5 pop
## 6 6 63 female NA 3 5 1.0 8.5 25 0.0 water
Data frames are like spreadsheets with rows and columns. The rows are the observations (here each row is a student in an intro stats class at Bowling Green State University) and the columns are the variables. Here the variables are answers to questions like what is your height, choose a number between 1 and 10, what time did you go to bed last night, etc.
The names of the columns are printed using the names() function.
names(df)## [1] "Student" "Height" "Gender" "Shoes" "Number" "Dvds" "ToSleep"
## [8] "WakeUp" "Haircut" "Job" "Drink"
All columns are of the same length, but not all students answered all questions so some of the data frame cells contain the missing-value indicator NA.
Data values in a data frame are stored in rows and columns and are accessed with bracket notation [row, column] where row is the row number and column is the column number like a matrix.
For example here you specify the data value in the 10th row and 2nd column (Height column) of the df data frame.
df[10, 2]## [1] 65
By specifying only the row index and leaving the column index blank you get all values in that row which corresponds to all the responses given by the 10th student.
df[10, ]## Student Height Gender Shoes Number Dvds ToSleep WakeUp Haircut Job Drink
## 10 10 65 male 10 7 22 2.5 8.5 12 0 milk
Drink preference was one of the questions. Responses across all students are available in the column labeled Drink as a vector of character values. You list all the different drink preferences by typing
df$Drink## [1] water pop milk water pop water water pop water milk milk water
## [13] pop milk pop water water pop water water water water water milk
## [25] pop water water pop water water water water pop water water water
## [37] pop milk pop water water water pop milk water water water pop
## [49] pop water milk pop pop water water pop milk pop pop water
## [61] water water water water water milk pop pop pop water water water
## [73] pop water pop pop water pop pop milk water pop water water
## [85] milk pop water water pop water water water milk water pop water
## [97] pop pop pop water water pop water pop milk milk water water
## [109] water water water pop water milk milk milk water milk pop water
## [121] pop pop pop pop water water water water water water milk water
## [133] pop milk water water water water water <NA> pop water water pop
## [145] milk milk water water pop water water water pop water <NA> water
## [157] water water water water milk milk water milk water water milk water
## [169] pop pop pop water pop pop water water milk milk water water
## [181] water pop pop water water pop pop water water milk water water
## [193] milk <NA> water pop milk pop milk water water water water water
## [205] water pop pop water milk water milk water milk water milk water
## [217] milk water pop water water milk water water pop milk milk water
## [229] milk water pop pop pop water water milk pop milk water milk
## [241] water water pop water water water pop pop water water pop water
## [253] water milk water pop water pop milk milk pop pop water water
## [265] water pop pop milk water water water water milk milk water water
## [277] milk milk milk pop water water <NA> water water water pop milk
## [289] water water pop water water milk pop milk milk water water water
## [301] pop water water <NA> water water water water water pop water water
## [313] water water pop water water water milk milk pop water water water
## [325] water water pop pop milk milk water water pop pop pop pop
## [337] water milk water water pop milk pop water water water pop water
## [349] water water water water water <NA> pop pop water milk water water
## [361] milk water water pop water water water water water water pop water
## [373] water milk water water milk milk milk water water water water pop
## [385] water water pop water pop milk pop water water <NA> water water
## [397] water water milk water pop milk water water water water water milk
## [409] pop pop pop water pop milk water water milk milk pop water
## [421] milk water pop milk water water water water pop water pop pop
## [433] pop milk pop water milk pop water pop pop pop water water
## [445] water water water water pop milk water water water pop milk milk
## [457] pop pop water water milk water milk pop water water water water
## [469] pop water milk water water water water water milk milk water water
## [481] pop water water milk water milk water pop pop water water pop
## [493] pop pop milk water water pop water water water water pop water
## [505] pop milk water <NA> milk water pop water water milk water water
## [517] water water water milk water water pop water pop water milk milk
## [529] milk milk pop water pop milk <NA> milk pop water water pop
## [541] milk pop water milk water pop water pop water pop water water
## [553] pop milk water water water water <NA> water water pop pop milk
## [565] water milk pop pop water water water pop pop pop pop water
## [577] water water water water pop pop water pop water water water water
## [589] milk water water water water pop pop water water water water water
## [601] water water pop water water <NA> milk pop water water water pop
## [613] water pop water pop water water pop pop water pop water milk
## [625] water pop pop pop water milk pop water pop water water milk
## [637] water water water water water water water pop pop pop pop water
## [649] pop water milk water water pop pop pop water
## Levels: milk pop water
Some students left that response blank and therefore the response is coded with the missing-value indicator.
The variable type depends on the question asked. For example, answers given to the question of student height result in a numeric variable, answers given to the question about drink preference result in a character (or factor) variable.
For integer, character, and factor variables we summarize the set of responses with the table() function.
table(df$Drink)##
## milk pop water
## 113 178 355
There are 113 students who prefer milk, 178 prefer soda, and 355 prefer water.
We use the plot() method to make a draft plot of this table.
plot(x = df$Drink)
Notice that the sum of the responses is 646, which is less than the total number of students (657).
Students who left that question blank are ignored in the table() function. To include the missing values you add the argument useNA = "ifany" to the table() function.
table(df$Drink,
useNA = "ifany")##
## milk pop water <NA>
## 113 178 355 11
Note: When you want code executed directly within the text you separate the code using single back ticks. This is useful when you write reports that need periodic updates when new data becomes available. Instead if you hard code the values in the text then you need to search the document for these values during each update.
Suppose you are interested in examining how long students reported sleeping. This was not asked directly. You compute it from the ToSleep and WakeUp times columns. You assign the result of the difference to a column we call SleepHrs.
df$SleepHrs <- df$WakeUp - df$ToSleep
head(df)## Student Height Gender Shoes Number Dvds ToSleep WakeUp Haircut Job Drink
## 1 1 67 female 10 5 10 -2.5 5.5 60 30.0 water
## 2 2 64 female 20 7 5 1.5 8.0 0 20.0 pop
## 3 3 61 female 12 2 6 -1.5 7.5 48 0.0 milk
## 4 4 61 female 3 6 40 2.0 8.5 10 0.0 water
## 5 5 70 male 4 5 6 0.0 9.0 15 17.5 pop
## 6 6 63 female NA 3 5 1.0 8.5 25 0.0 water
## SleepHrs
## 1 8.0
## 2 6.5
## 3 9.0
## 4 6.5
## 5 9.0
## 6 7.5
Now you have a new numeric variable in the data frame called SleepHrs.
You can’t table numeric variables, but the summary() method prints a set of summary statistics for the set of values.
summary(df$SleepHrs)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.500 6.500 7.500 7.385 8.500 12.500 4
The average number of hours slept is 7.4 with a maximum of 12.5 and a minimum of 2.5. There are four students that did not answer either when they went to sleep or when they woke up questions.
You use the hist() function to construct a histogram of sleep hours.
hist(x = df$SleepHrs)
The histogram function divides the number of sleep hours into one-hour bins and counts the number of students whose reported sleep hours falls into each bin. For example based on when they said they went to sleep and when the said they woke up, about 100 students slept between five and six hours the night before the survey.
Since the gender of each student is reported, you can make comparisons between those who identify as male and those who identify as female. For instance, do men sleep more than women? You can answer this question graphically with box plots using the plot() method. You specify the character variable on the horizontal axis (x) to be gender with the x = argument and the numeric variable on the vertical axis (y) with the y = argument.
plot(x = df$Gender,
y = df$SleepHrs)
The plot reveals little difference in the amount of sleep.
Repeat for hair cut prices.
plot(x = df$Gender,
y = df$Haircut)
Big difference.
Finally, is the amount of sleep for a student related to when they go to bed? If you place numeric variables on the x and y axes then you get a scatter plot.
plot(x = df$ToSleep,
y = df$SleepHrs)
The ToSleep variable is centered on midnight so that -2 means a student went to sleep at 10p.
You describe the decreasing relationship with a line through the points. The least-squares line is fit using the lm() function and the line is drawn on the existing plot with the abline() function applied to the linear regression object model.
model <- lm(SleepHrs ~ ToSleep,
data = df)
plot(x = df$ToSleep,
y = df$SleepHrs)
abline(model)
Tornadoes
Most of the time you will start by getting your data stored in a file into R. Secondary source data should be imported directly from repositories on the Web. When there is no API (application programming interface) to the repository, you need to first download the data.
For example, consider the regularly updated reports of tornadoes in the United States. The data repository is the Storm Prediction Center (SPC) https://www.spc.noaa.gov/wcm/index.html#data.
Here you are interested in the file called 1950-2020_actual_tornadoes.csv. First you download the file from the site with the download.file() function specifying the location (url =) and a name you want the file to be called on your computer (destfile =).
download.file(url = "http://www.spc.noaa.gov/wcm/data/1950-2019_actual_tornadoes.csv",
destfile = here::here("data", "Tornadoes.csv"))A file called Tornadoes.csv should now be located in the directory data. Click on the Files tab in the lower-right panel, then select the data folder.
Next you read (import) the file as a data frame using the readr::read_csv() function from the {tidyverse} group of packages.
Torn.df <- readr::read_csv(file = here::here("data", "Tornadoes.csv"))## Rows: 65162 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): mo, dy, st, stf
## dbl (23): om, yr, tz, stn, mag, inj, fat, loss, closs, slat, slon, elat, el...
## date (1): date
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You preview the data frame using the head() function.
head(Torn.df)## # A tibble: 6 × 29
## om yr mo dy date time tz st stf stn mag inj
## <dbl> <dbl> <chr> <chr> <date> <time> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1950 01 03 1950-01-03 11:00 3 MO 29 1 3 3
## 2 2 1950 01 03 1950-01-03 11:55 3 IL 17 2 3 3
## 3 3 1950 01 03 1950-01-03 16:00 3 OH 39 1 1 1
## 4 4 1950 01 13 1950-01-13 05:25 3 AR 5 1 3 1
## 5 5 1950 01 25 1950-01-25 19:30 3 MO 29 2 2 5
## 6 6 1950 01 25 1950-01-25 21:00 3 IL 17 3 2 0
## # … with 17 more variables: fat <dbl>, loss <dbl>, closs <dbl>, slat <dbl>,
## # slon <dbl>, elat <dbl>, elon <dbl>, len <dbl>, wid <dbl>, ns <dbl>,
## # sn <dbl>, sg <dbl>, f1 <dbl>, f2 <dbl>, f3 <dbl>, f4 <dbl>, fc <dbl>
Each row is a unique tornado report. Observations for each report include the day and time, the state (st), the maximum EF rating (mag), the number of injuries (inj), the number of fatalities (fat), estimated property losses (loss), estimated crop losses (closs), start and end locations in decimal degrees longitude and latitude, length of the damage path in miles (len), width of the damage in yards (wid).
The total number of tornado reports in the data set is returned using the nrow() function.
nrow(Torn.df)## [1] 65162
To create a subset of the data frame that contains only tornadoes in years (yr) since 2001, you include the logical operator yr >= 2001 inside the subset operator. The logical operator is placed in front of the comma since you want all rows where the result of the operator returns a value TRUE.
Torn2.df <- Torn.df[Torn.df$yr >= 2001, ]You see that there are fewer rows (tornado reports) in this new data frame assigned the object name Torn2.df.
You subset again, keeping only tornadoes with EF ratings (mag variable) greater than zero. Here you recycle the name Torn2.df.
Torn2.df <- Torn2.df[Torn2.df$mag > 0, ]Now you compute the correlation between EF rating (mag) and path length (len) with the cor() function. The first argument is the vector of EF ratings and the second argument is the vector of path lengths.
cor(Torn2.df$mag, Torn2.df$len)## [1] 0.4857969
Path length is recorded in miles and path width in yards and the EF damage rating variable mag is numeric. To convert path length to kilometers, path width to meters, and the EF rating to a factor and then adding these changes as new columns, type
Torn2.df$Length <- Torn2.df$len * 1609.34
Torn2.df$Width <- Torn2.df$wid * .9144
Torn2.df$EF <- factor(Torn2.df$mag)Create side-by-side box plots of path length (in kilometers) by EF rating.
plot(x = Torn2.df$EF,
y = Torn2.df$Length/1000)
Hurricane data
Here you import the data directly from the Web by specifying the URL as a character string using the file = argument.
loc <- "http://myweb.fsu.edu/jelsner/temp/data/US.txt"
USHur.df <- readr::read_table(file = loc)##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## All = col_double(),
## MUS = col_double(),
## G = col_double(),
## FL = col_double(),
## E = col_double()
## )
The dim() function returns the size of the data frame defined as the number of rows and the number of columns.
dim(USHur.df)## [1] 166 6
There are 166 rows and 6 columns in the data frame. Each row is a year and the columns include Year, number of hurricanes (All), number of major hurricanes (MUS), number of Gulf coast hurricanes (G), number of Florida hurricanes (FL), and number of East coast hurricanes (E) in that order.
To get a glimpse of the data values you list the first six lines of the data frame using the head() function.
head(USHur.df)## # A tibble: 6 × 6
## Year All MUS G FL E
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1851 1 1 0 1 0
## 2 1852 3 1 1 2 0
## 3 1853 0 0 0 0 0
## 4 1854 2 1 1 0 1
## 5 1855 1 1 1 0 0
## 6 1856 2 1 1 1 0
The distribution of Florida hurricane counts by year is obtained using the table() function and specifying the FL column with df$FL.
table(USHur.df$FL)##
## 0 1 2 3 4
## 93 43 24 5 1
There are 93 years without a FL hurricane, 43 years with exactly one hurricane, 24 years with two hurricanes, and so on.
Rainfall data
The data are monthly statewide average rainfall (in inches) for Florida starting in 1895 from http://www.esrl.noaa.gov/psd/data/timeseries/. Note: I put values into a text editor and then uploaded the file to the Web at location http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt.
To import the data you use the readr::read_table() function and assign the object the name FLp.df. You type the name of the object to see that it is a tabled data frame (tibble) with 117 rows and 13 columns.
loc <- "http://myweb.fsu.edu/jelsner/temp/data/FLprecip.txt"
FLp.df <- readr::read_table(file = loc)##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Year = col_double(),
## Jan = col_double(),
## Feb = col_double(),
## Mar = col_double(),
## Apr = col_double(),
## May = col_double(),
## Jun = col_double(),
## Jul = col_double(),
## Aug = col_double(),
## Sep = col_double(),
## Oct = col_double(),
## Nov = col_double(),
## Dec = col_double()
## )
FLp.df## # A tibble: 117 × 13
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1895 3.28 3.24 2.50 4.53 4.25 4.5 7.45 6.10 4.67 3.09 2.65 1.59
## 2 1896 3.93 3.02 2.57 0.498 2.7 11.2 8.22 5.89 4.35 2.96 3.52 2.07
## 3 1897 1.84 6 2.12 4.39 2.28 5.22 7.21 6.83 11.1 4.10 1.75 2.68
## 4 1898 0.704 2.01 1.26 1.32 1.51 3.29 8.95 13.1 5.23 5.88 2.19 3.89
## 5 1899 4.52 5.92 1.90 3.40 1.11 5.80 9.26 6.71 5.13 5.88 0.751 1.94
## 6 1900 3.21 4.37 6.8 4.32 3.89 9.99 7.50 4.49 4.93 5.23 1.22 4.29
## 7 1901 2.34 4.21 5.37 2.14 4.15 10.4 6.42 10.9 8.33 1.71 0.841 2.49
## 8 1902 0.633 4.81 4.29 1.38 2.36 6.22 5.24 4.80 9.54 5.21 3.02 3.52
## 9 1903 5.06 5.58 5.45 0.429 4.74 7.01 6.63 6.96 7.47 1.75 2.7 1.70
## 10 1904 4.96 3.02 1.59 1.66 2.49 6.59 6.27 7.53 4.5 4.41 2.87 1.84
## # … with 107 more rows
The first column is the year and the next 12 columns are the months.
What was the statewide rainfall during June of 1900?
FLp.df$Year == 1900## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FLp.df$Jun[FLp.df$Year == 1900]## [1] 9.993
What year had the wettest March?
FLp.df$Mar## [1] 2.499 2.570 2.125 1.259 1.898 6.800 5.370 4.291 5.451 1.591 3.849 3.191
## [13] 0.562 0.779 2.792 1.899 2.180 3.932 5.553 1.528 2.598 0.889 2.027 2.497
## [25] 5.409 1.388 1.981 2.422 2.181 5.969 1.858 4.329 2.400 4.392 3.374 7.449
## [37] 5.312 3.659 3.898 3.363 0.960 3.103 4.257 1.764 1.407 3.515 3.918 6.123
## [49] 4.441 5.685 0.637 4.152 7.133 6.822 2.043 4.018 3.293 3.852 3.090 2.404
## [61] 1.643 1.325 4.601 6.416 8.701 6.357 2.489 3.808 1.707 3.237 4.042 1.826
## [73] 1.193 1.569 5.991 8.388 2.142 4.494 5.516 2.525 2.353 2.553 2.002 4.226
## [85] 2.143 5.043 3.176 5.379 7.213 4.710 2.537 4.297 8.443 5.101 3.349 2.672
## [97] 7.097 3.299 5.097 3.839 3.395 7.575 2.754 6.042 1.790 3.207 6.824 2.700
## [109] 6.642 0.994 6.027 0.496 1.213 3.568 2.662 5.995 4.063
max(FLp.df$Mar)## [1] 8.701
which.max(FLp.df$Mar)## [1] 65
FLp.df$Year[which.max(FLp.df$Mar)]## [1] 1959
What month during 1965 was the wettest? How wet was it?
FLp.df$Year == 1965## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FLp.df[FLp.df$Year == 1965, ]## # A tibble: 1 × 13
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1965 1.80 4.58 4.04 2.54 1.08 10.0 8.54 7.14 6.69 4.66 1.58 2.76
which.max(FLp.df[FLp.df$Year == 1965, 2:12])## Jun
## 6
which.max(FLp.df[FLp.df$Year == 1965, 2:12])## Jun
## 6
max(FLp.df[FLp.df$Year == 1965, 2:12])## [1] 10.032
Using functions from the {dplyr} package
The functions in the {dplyr} package simplify working with data frames. The functions work only on data frames.
The function names are English language verbs so they are easy to remember. The verbs help you to translate your thoughts into code.
We consider the verbs one at a time using the airquality data frame. The data frame contains air quality measurements taken in New York City between May and September 1973. (?airquality).
dim(airquality)## [1] 153 6
head(airquality)## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
The columns include Ozone (ozone concentration in ppb), Solar.R (solar radiation in langleys), Wind (wind speed in mph), Temp (air temperature in degrees F), Month, and Day.
We get summary statistics on the values in each column with the summary() method.
summary(airquality)## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
Note that columns that have missing values are tabulated. For example, there are 37 missing ozone measurements and 7 missing radiation measurements.
Importantly for making your code more human readable you can apply the summary() function on the airquality data frame using the pipe operator (|>).
airquality |> summary()## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
You read the pipe as THEN. “take the airquality data frame THEN summarize the columns”.
The pipe operator allows you to string together functions that when read by a human makes it easy to understand what is being done.
Hypothetically, suppose the object of interest is called me and there was a function called wake_up(). I could apply this function called wake_up() in two ways.
wake_up(me) # way number one
me |> wake_up() # way number twoThe second way involves a bit more typing but it is easier to read (the subject comes before the predicate) and thus easier to understand. This becomes clear when stringing together functions.
Continuing with the hypothetical example, what happens to the result of me after the function wake_up() has been applied? I get_out_of_bed() and then get_dressed().
Again, you can apply these functions in two ways.
get_dressed(get_out_of_bed(wake_up(me)))
me |>
wake_up() |>
get_out_of_bed() |>
get_dressed()The order of the functions usually matters to the outcome.
Note that I format the code to make it easy to read. Each line is gets only one verb and each line ends with the pipe (except the last one).
Continuing…
me |>
wake_up() |>
get_out_of_bed() |>
get_dressed() |>
make_coffee() |>
drink_coffee() |>
leave_house()Which is much better in terms of ‘readability’ then leave_house(drink_coffee(make_coffee(get_dressed(get_out_of_bed(wake_up(me)))))).
Tibbles are data frames that make life a little easier. R is an old language, and some things that were useful 10 or 20 years ago now get in your way. To make a data frame a tibble (tabular data frame) use the as_tibble() function.
class(airquality)## [1] "data.frame"
airquality <- dplyr::as_tibble(airquality)
class(airquality)## [1] "tbl_df" "tbl" "data.frame"
Click on airquality in the environment. It is a data frame. We will use the terms ‘tibble’ and ‘data frame’ interchangeably in this class.
Now you are ready to look at some of the commonly used verbs and to see how to apply them to a data frame.
The function select() chooses variables by name. For example, choose the month (Month), day (Day), and temperature (Temp) columns.
airquality |>
dplyr::select(Month, Day, Temp)## # A tibble: 153 × 3
## Month Day Temp
## <int> <int> <int>
## 1 5 1 67
## 2 5 2 72
## 3 5 3 74
## 4 5 4 62
## 5 5 5 56
## 6 5 6 66
## 7 5 7 65
## 8 5 8 59
## 9 5 9 61
## 10 5 10 69
## # … with 143 more rows
The result is a data frame containing only the three columns with column names listed in the select() function.
Suppose you want a new data frame with only the temperature and ozone concentrations. You include an assignment operator (<-) and an object name (here df).
df <- airquality |>
dplyr::select(Temp, Ozone)
df## # A tibble: 153 × 2
## Temp Ozone
## <int> <int>
## 1 67 41
## 2 72 36
## 3 74 12
## 4 62 18
## 5 56 NA
## 6 66 28
## 7 65 23
## 8 59 19
## 9 61 8
## 10 69 NA
## # … with 143 more rows
The verbs take only data frames as input and return only data frames.
The function filter() chooses observations based on specific values. Suppose we want only the observations where the temperature is at or above 80 F.
airquality |>
dplyr::filter(Temp >= 80)## # A tibble: 73 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 45 252 14.9 81 5 29
## 2 NA 186 9.2 84 6 4
## 3 NA 220 8.6 85 6 5
## 4 29 127 9.7 82 6 7
## 5 NA 273 6.9 87 6 8
## 6 71 291 13.8 90 6 9
## 7 39 323 11.5 87 6 10
## 8 NA 259 10.9 93 6 11
## 9 NA 250 9.2 92 6 12
## 10 23 148 8 82 6 13
## # … with 63 more rows
The result is a data frame with the same 6 columns but now only 73 observations. Each of the observations has a temperature of at least 80 F.
Suppose you want a new data frame keeping only observations when temperature is at least 80 F and when winds are less than 5 mph.
df <- airquality |>
dplyr::filter(Temp >= 80 & Wind < 5)
df## # A tibble: 8 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 135 269 4.1 84 7 1
## 2 64 175 4.6 83 7 5
## 3 66 NA 4.6 87 8 6
## 4 122 255 4 89 8 7
## 5 168 238 3.4 81 8 25
## 6 118 225 2.3 94 8 29
## 7 73 183 2.8 93 9 3
## 8 91 189 4.6 93 9 4
The function arrange() orders the rows by values given in a particular column.
airquality |>
dplyr::arrange(Solar.R)## # A tibble: 153 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 16 7 6.9 74 7 21
## 2 1 8 9.7 59 5 21
## 3 23 13 12 67 5 28
## 4 23 14 9.2 71 9 22
## 5 8 19 20.1 61 5 9
## 6 14 20 16.6 63 9 25
## 7 9 24 13.8 81 8 2
## 8 9 24 10.9 71 9 14
## 9 4 25 9.7 61 5 23
## 10 13 27 10.3 76 9 18
## # … with 143 more rows
The ordering is done from the lowest value of radiation to highest value. Here you see the first 10 rows. Note Month and Day are no longer chronological.
Repeat but order by the value of air temperature.
airquality |>
dplyr::arrange(Temp)## # A tibble: 153 × 6
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 NA NA 14.3 56 5 5
## 2 6 78 18.4 57 5 18
## 3 NA 66 16.6 57 5 25
## 4 NA NA 8 57 5 27
## 5 18 65 13.2 58 5 15
## 6 NA 266 14.9 58 5 26
## 7 19 99 13.8 59 5 8
## 8 1 8 9.7 59 5 21
## 9 8 19 20.1 61 5 9
## 10 4 25 9.7 61 5 23
## # … with 143 more rows
Importantly you can string the functions together. For example select the variables radiation, wind, and temperature then filter by temperatures above 90 F and arrange by temperature.
airquality |>
dplyr::select(Solar.R, Wind, Temp) |>
dplyr::filter(Temp > 90) |>
dplyr::arrange(Temp)## # A tibble: 14 × 3
## Solar.R Wind Temp
## <int> <dbl> <int>
## 1 291 14.9 91
## 2 167 6.9 91
## 3 250 9.2 92
## 4 267 6.3 92
## 5 272 5.7 92
## 6 222 8.6 92
## 7 197 5.1 92
## 8 259 10.9 93
## 9 183 2.8 93
## 10 189 4.6 93
## 11 225 2.3 94
## 12 188 6.3 94
## 13 237 6.3 96
## 14 203 9.7 97
The result is a data frame with three columns and 14 rows arranged by increasing temperatures above 90 F.
The mutate() function adds new columns to the data frame. For example, create a new column called TempC as the temperature in degrees Celsius. Also create a column called WindMS as the wind speed in meters per second.
airquality |>
dplyr::mutate(TempC = (Temp - 32) * 5/9,
WindMS = Wind * .44704) ## # A tibble: 153 × 8
## Ozone Solar.R Wind Temp Month Day TempC WindMS
## <int> <int> <dbl> <int> <int> <int> <dbl> <dbl>
## 1 41 190 7.4 67 5 1 19.4 3.31
## 2 36 118 8 72 5 2 22.2 3.58
## 3 12 149 12.6 74 5 3 23.3 5.63
## 4 18 313 11.5 62 5 4 16.7 5.14
## 5 NA NA 14.3 56 5 5 13.3 6.39
## 6 28 NA 14.9 66 5 6 18.9 6.66
## 7 23 299 8.6 65 5 7 18.3 3.84
## 8 19 99 13.8 59 5 8 15 6.17
## 9 8 19 20.1 61 5 9 16.1 8.99
## 10 NA 194 8.6 69 5 10 20.6 3.84
## # … with 143 more rows
The resulting data frame has 8 columns (two new ones) labeled TempC and WindMS.
On days when the temperature is below 60 F add a column giving the apparent temperature based on the cooling effect of the wind (wind chill) and then arrange from coldest to warmest apparent temperature.
airquality |>
dplyr::filter(Temp < 60) |>
dplyr::mutate(TempAp = 35.74 + .6215 * Temp - 35.75 * Wind^.16 + .4275 * Temp * Wind^.16) |>
dplyr::arrange(TempAp)## # A tibble: 8 × 7
## Ozone Solar.R Wind Temp Month Day TempAp
## <int> <int> <dbl> <int> <int> <int> <dbl>
## 1 NA NA 14.3 56 5 5 52.5
## 2 6 78 18.4 57 5 18 53.0
## 3 NA 66 16.6 57 5 25 53.3
## 4 NA 266 14.9 58 5 26 54.9
## 5 18 65 13.2 58 5 15 55.2
## 6 NA NA 8 57 5 27 55.3
## 7 19 99 13.8 59 5 8 56.4
## 8 1 8 9.7 59 5 21 57.3
The summarize() function reduces the data frame based on a function that computes a statistic. For examples, to compute the average wind speed during July or the average temperature during June type
airquality |>
dplyr::filter(Month == 7) |>
dplyr::summarize(Wavg = mean(Wind))## # A tibble: 1 × 1
## Wavg
## <dbl>
## 1 8.94
airquality |>
dplyr::filter(Month == 6) |>
dplyr::summarize(Tavg = mean(Temp))## # A tibble: 1 × 1
## Tavg
## <dbl>
## 1 79.1
We’ve seen functions that compute statistics including sum(), sd(), min(), max(), var(), range(), median(). Others include:
| Summary function | Description |
|---|---|
dplyr::n() |
Length of the column |
dplyr::first() |
First value of the column |
dplyr::last() |
Last value of the column |
dplyr::n_distinct() |
Number of distinct values |
Find the maximum and median wind speed and maximum ozone concentration values during the month of May. Also determine the number of observations during May.
airquality |>
dplyr::filter(Month == 5) |>
dplyr::summarize(Wmax = max(Wind),
Wmed = median(Wind),
OzoneMax = max(Ozone),
NumDays = dplyr::n())## # A tibble: 1 × 4
## Wmax Wmed OzoneMax NumDays
## <dbl> <dbl> <int> <int>
## 1 20.1 11.5 NA 31
The result gives an NA for the maximum value of ozone (OzoneMax) because there is at least one missing value in the Ozone column. You fix this with the na.rm = TRUE argument in the function max().
airquality |>
dplyr::filter(Month == 5) |>
dplyr::summarize(Wmax = max(Wind),
Wmed = median(Wind),
OzoneMax = max(Ozone, na.rm = TRUE),
NumDays = dplyr::n())## # A tibble: 1 × 4
## Wmax Wmed OzoneMax NumDays
## <dbl> <dbl> <int> <int>
## 1 20.1 11.5 115 31
If you want to summarize separately for each month you use the group_by() function. You split the data frame by some variable (e.g., Month), apply a function to the individual data frames, and then combine the output.
Find the highest ozone concentration by month. Include the number of observations (days) in the month.
airquality |>
dplyr::group_by(Month) |>
dplyr::summarize(OzoneMax = max(Ozone, na.rm = TRUE),
NumDays = dplyr::n())## # A tibble: 5 × 3
## Month OzoneMax NumDays
## <int> <int> <int>
## 1 5 115 31
## 2 6 71 30
## 3 7 135 31
## 4 8 168 31
## 5 9 96 30
Find the average ozone concentration when temperatures are above and below 70 F. Include the number of observations (days) in the two groups.
airquality |>
dplyr::group_by(Temp >= 70) |>
dplyr::summarize(OzoneAvg = mean(Ozone, na.rm = TRUE),
NumDays = dplyr::n())## # A tibble: 2 × 3
## `Temp >= 70` OzoneAvg NumDays
## <lgl> <dbl> <int>
## 1 FALSE 18.0 32
## 2 TRUE 49.1 121
On average ozone concentration is higher on warm days (Temp >= 70 F) days. Said another way; mean ozone concentration statistically depends on temperature.
The mean is a model for the data. The statistical dependency of the mean implies that a model for ozone concentration will be improved by including temperature as an explanatory variable.
In summary, the important verbs are
| Verb | Description |
|---|---|
select() |
selects columns; pick variables by their names |
filter() |
filters rows; pick observations by their values |
arrange() |
re-orders the rows |
mutate() |
creates new columns; create new variables with functions of existing variables |
summarize() |
summarizes values; collapse many values down to a single summary |
group_by() |
allows operations to be grouped |
The six functions form the basis of a grammar for data. You can only alter a data frame by reordering the rows (arrange()), picking observations and variables of interest (filter() and select()), adding new variables that are functions of existing variables (mutate()), collapsing many values to a summary (summarise()), and conditioning on variables (group_by()).
The syntax of the functions are all the same:
- The first argument is a data frame. This argument is implicit when using the
|>operator. - The subsequent arguments describe what to do with the data frame. You refer to columns in the data frame directly (without using
$). - The result is a new data frame
These properties make it easy to chain together many simple lines of code to do complex data manipulations and summaries all while making it easy to read by humans.
Thursday September 1, 2022
“Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.” — Hadley Wickham
Today
- Making graphs
Working with data frames is part of the iterative cycle of data science, along with visualizing, and modeling. The iterative cycle of data science:
- Generate questions about our data.
- Look for answers by visualizing and modeling the data after the data are in suitably arranged data frames.
- Use what we learn to refine our questions and/or ask new ones.
Questions are tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of the data set and helps you decide what to do.
For additional practice working with data frames using functions from the {tidyverse} set of packages.
- See http://r4ds.had.co.nz/index.html
- Cheat sheets http://rstudio.com/cheatsheets
Before moving on, let’s consider another data frame. The data frame contains observations on Palmer penguins and is available from https://education.rstudio.com/blog/2020/07/palmerpenguins-cran/.
You import the data frame using the read_csv() function.
loc <- "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
penguins <- readr::read_csv(file = loc)## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
penguins## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <chr>, year <dbl>
The observations are 344 individual penguins each described by species (Adelie, Chinstrap, Gentoo), where it was found (island name), length of bill (mm), depth of bill (mm), body mass (g), male or female, and year.
Each penguin belongs to one of three species. To see how many of the 344 penguins are in each species you use the table() function.
table(penguins$species)##
## Adelie Chinstrap Gentoo
## 152 68 124
There are 152 Adelie, 68 Chinstrap, and 124 Gentoo penguins.
To create a data frame that includes only the female penguins you type
( df <- penguins |>
dplyr::filter(sex == "female") )## # A tibble: 165 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.5 17.4 186 3800
## 2 Adelie Torgersen 40.3 18 195 3250
## 3 Adelie Torgersen 36.7 19.3 193 3450
## 4 Adelie Torgersen 38.9 17.8 181 3625
## 5 Adelie Torgersen 41.1 17.6 182 3200
## 6 Adelie Torgersen 36.6 17.8 185 3700
## 7 Adelie Torgersen 38.7 19 195 3450
## 8 Adelie Torgersen 34.4 18.4 184 3325
## 9 Adelie Biscoe 37.8 18.3 174 3400
## 10 Adelie Biscoe 35.9 19.2 189 3800
## # … with 155 more rows, and 2 more variables: sex <chr>, year <dbl>
To create a data frame that includes only penguins that are not of species Adalie you type
( df <- penguins |>
dplyr::filter(species != "Adelie") )## # A tibble: 192 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gentoo Biscoe 46.1 13.2 211 4500
## 2 Gentoo Biscoe 50 16.3 230 5700
## 3 Gentoo Biscoe 48.7 14.1 210 4450
## 4 Gentoo Biscoe 50 15.2 218 5700
## 5 Gentoo Biscoe 47.6 14.5 215 5400
## 6 Gentoo Biscoe 46.5 13.5 210 4550
## 7 Gentoo Biscoe 45.4 14.6 211 4800
## 8 Gentoo Biscoe 46.7 15.3 219 5200
## 9 Gentoo Biscoe 43.3 13.4 209 4400
## 10 Gentoo Biscoe 46.8 15.4 215 5150
## # … with 182 more rows, and 2 more variables: sex <chr>, year <dbl>
To create a data frame containing only penguins that weigh more than 6000 grams you type
( df <- penguins |>
dplyr::filter(body_mass_g > 6000) )## # A tibble: 2 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Gentoo Biscoe 49.2 15.2 221 6300 male
## 2 Gentoo Biscoe 59.6 17 230 6050 male
## # … with 1 more variable: year <dbl>
To create a data frame with female penguins that have flippers longer than 220 mm we type
( df <- penguins |>
dplyr::filter(flipper_length_mm > 220 &
sex == "female") )## # A tibble: 1 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Gentoo Biscoe 46.9 14.6 222 4875 fema…
## # … with 1 more variable: year <dbl>
To create a data frame containing rows where the bill length value is NOT missing.
( df <- penguins |>
dplyr::filter(!is.na(bill_length_mm)) )## # A tibble: 342 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 34.1 18.1 193 3475
## 9 Adelie Torgersen 42 20.2 190 4250
## 10 Adelie Torgersen 37.8 17.1 186 3300
## # … with 332 more rows, and 2 more variables: sex <chr>, year <dbl>
Note that this filtering will keep rows with other column values missing values but there will be no penguins where the bill_length value is NA.
Finally, to compute the average bill length for each species.
penguins |>
dplyr::group_by(species) |>
dplyr::summarize(AvgBL = mean(bill_length_mm, na.rm = TRUE))## # A tibble: 3 × 2
## species AvgBL
## <chr> <dbl>
## 1 Adelie 38.8
## 2 Chinstrap 48.8
## 3 Gentoo 47.5
Making graphs
The {ggplot2} package is a popular graphics tool among data scientists (e.g., New York Times and 538). Functionality is built on principles of good data visualization.
- Mapping data to aesthetics
- Layering
- Building plots step by step
You make the functions available to your current working directory by typing
library(ggplot2)Map data to aesthetics
Consider the following numeric vectors (foo, bar and zaz). Create a data frame df using the data.frame() function.
foo <- c(-122.419416,-121.886329,-71.05888,-74.005941,-118.243685,-117.161084,-0.127758,-77.036871,
116.407395,-122.332071,-87.629798,-79.383184,-97.743061,121.473701,72.877656,2.352222,
77.594563,-75.165222,-112.074037,37.6173)
bar <- c(37.77493,37.338208,42.360083,40.712784,34.052234,32.715738,51.507351,38.907192,39.904211,
47.60621,41.878114,43.653226,30.267153,31.230416,19.075984,48.856614,12.971599,39.952584,
33.448377,55.755826)
zaz <- c(6471,4175,3144,2106,1450,1410,842,835,758,727,688,628,626,510,497,449,419,413,325,318)
df <- data.frame(foo, bar, zaz)
head(df)## foo bar zaz
## 1 -122.41942 37.77493 6471
## 2 -121.88633 37.33821 4175
## 3 -71.05888 42.36008 3144
## 4 -74.00594 40.71278 2106
## 5 -118.24368 34.05223 1450
## 6 -117.16108 32.71574 1410
To make a scatter plot you use the ggplot() function. Note that the package name is {ggplot2} but the function is ggplot() (without the 2).
Inside the ggplot() function you specify the data frame with the data = argument. You also specify what columns from the data frame are to be mapped to what ‘aesthetics’ with the aes() function using the mapping = argument. The aes() function is nested inside the ggplot() function or inside a layer function.
For a scatter plot the aesthetics must include the x and y coordinates at a minimum, and for this example they are in the columns labeled foo and bar respectively.
Then to render the scatter plot you include the function geom_point() as a layer with the + symbol. Numeric values are specified using the arguments x = and y = in the aes() function and are rendered as points on a plot.
ggplot(data = df,
mapping = aes(x = foo, y = bar)) +
geom_point()
You map data values to aesthetic attributes. The points in the scatter plot are geometric objects that get drawn. In {ggplot2} lingo, the points are geoms. More specifically, the points are point geoms that are denoted syntactically with the function geom_point().
All geometric objects have aesthetic attributes (aesthetics):
- x-position
- y-position
- color
- size
- transparency
You create a mapping between variables in your data frame and the aesthetic attributes of geometric objects. In the scatter plot you mapped foo to the x-position aesthetic and bar to the y-position aesthetic. This may seem trivial foo is the x-axis and bar is on the y-axis. You certainly can do that in Excel.
Here there is a deeper structure. Theoretically, geometric objects (i.e., the things you draw in a plot, like points) don’t just have attributes like position. They have a color, size, etc.
For example here you map a new variable to the size aesthetic.
ggplot(data = df,
mapping = aes(x = foo, y = bar)) +
geom_point(mapping = aes(size = zaz))
You changed the scatter plot to a bubble chart by mapping a new variable to the size aesthetic. Any visualization can be deconstructed into geom specifications and a mapping from data to the aesthetic attributes of the geometric objects.
Build plots in layers
The principle of layering is important. To create good visualizations you often need to:
- Plot multiple datasets, or
- Plot a dataset with additional contextual information contained in a second dataset, or
- Plot summaries or statistical transformations over the raw data
Let’s modify the bubble chart by getting additional data and plotting it as a new layer below the bubbles. First get the data from the {maps} package using the map_data() function and specifying the name of the map (here "World") and assigning it to a data frame with the name df2.
df2 <- map_data(map = "world") |>
dplyr::glimpse()## Rows: 99,338
## Columns: 6
## $ long <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.0…
## $ lat <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707, …
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 1…
## $ region <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba…
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Plot the new data as a new layer underneath the bubbles.
ggplot(data = df,
aes(x = foo, y = bar)) +
geom_polygon(data = df2,
mapping = aes(x = long, y = lat, group = group)) +
geom_point(mapping = aes(size = zaz), color = "red")
This is the same bubble chart but now with a new layer added. You changed the bubble chart into a new visualization called a “dot distribution map,” which is more insightful and visually interesting.
The bubble chart is a modified scatter plot and the dot distribution map is a modified bubble chart.
You used two of the data visualization principles (mapping & layering) to build this plot:
- To create the scatter plot, you mapped
footo the x-aesthetic and mappedbarto the y-aesthetic - To create the bubble chart, you mapped a
zazto the size-aesthetic - To create the dot distribution map, you added a layer of polygon data under the bubbles.
Iteration (step by step)
The third principle is about process. The graphing process begins with mapping and layering but ends with iteration when you add layers that modify scales, legends, colors, etc. The syntax of ggplot layerability enables and rewards iteration.
Instead of plotting the result of the above code for making a bubble chart, assign the result to an object called p1. Coping/paste the code from above then include the assignment operator p1 <-.
p1 <- ggplot(data = df,
mapping = aes(x = foo, y = bar)) +
geom_polygon(data = df2,
mapping = aes(x = long, y = lat, group = group)) +
geom_point(aes(size = zaz), color = "red")Now modify the axes labels saving the new plot to an object called p2.
( p2 <- p1 + xlab("Longitude") + ylab("Latitude") )
Next modify the scale label.
p2 + scale_size_continuous(name = "Venture Capital Investment\n(USD, Millions)\n")
Of course you can do this all together with
p1 + xlab("Longitude") +
ylab("Latitude") +
scale_size_continuous(name = "Venture Capital Investment\n(USD, Millions)\n")
The facet_wrap() function is a layer to iterate (repeat) the entire plot conditional on another variable. It is like the dplyr::group_by() function in the data grammar.
US tornadoes
Consider the tornado records in the file Tornadoes.csv. Import the data using the readr::read_csv() function then create new columns called Year, Month and EF using the dplyr::mutate() function.
( Torn.df <- readr::read_csv(file = here::here("data", "Tornadoes.csv")) |>
dplyr::mutate(Year = yr,
Month = as.integer(mo),
EF = mag) )## Rows: 65162 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): mo, dy, st, stf
## dbl (23): om, yr, tz, stn, mag, inj, fat, loss, closs, slat, slon, elat, el...
## date (1): date
## time (1): time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 65,162 × 32
## om yr mo dy date time tz st stf stn mag inj
## <dbl> <dbl> <chr> <chr> <date> <time> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1950 01 03 1950-01-03 11:00 3 MO 29 1 3 3
## 2 2 1950 01 03 1950-01-03 11:55 3 IL 17 2 3 3
## 3 3 1950 01 03 1950-01-03 16:00 3 OH 39 1 1 1
## 4 4 1950 01 13 1950-01-13 05:25 3 AR 5 1 3 1
## 5 5 1950 01 25 1950-01-25 19:30 3 MO 29 2 2 5
## 6 6 1950 01 25 1950-01-25 21:00 3 IL 17 3 2 0
## 7 7 1950 01 26 1950-01-26 18:00 3 TX 48 1 2 2
## 8 8 1950 02 11 1950-02-11 13:10 3 TX 48 2 2 0
## 9 9 1950 02 11 1950-02-11 13:50 3 TX 48 3 3 12
## 10 10 1950 02 11 1950-02-11 21:00 3 TX 48 4 2 5
## # … with 65,152 more rows, and 20 more variables: fat <dbl>, loss <dbl>,
## # closs <dbl>, slat <dbl>, slon <dbl>, elat <dbl>, elon <dbl>, len <dbl>,
## # wid <dbl>, ns <dbl>, sn <dbl>, sg <dbl>, f1 <dbl>, f2 <dbl>, f3 <dbl>,
## # f4 <dbl>, fc <dbl>, Year <dbl>, Month <int>, EF <dbl>
Next create a data frame (df) containing the number of tornadoes by year for the state of Kansas.
( df <- Torn.df |>
dplyr::filter(st == "KS") |>
dplyr::group_by(Year) |>
dplyr::summarize(nT = dplyr::n()) )## # A tibble: 70 × 2
## Year nT
## <dbl> <int>
## 1 1950 30
## 2 1951 77
## 3 1952 19
## 4 1953 29
## 5 1954 68
## 6 1955 96
## 7 1956 57
## 8 1957 63
## 9 1958 49
## 10 1959 65
## # … with 60 more rows
Then use the functions from the {ggplot2} package to plot the number of tornadoes by year using lines to connect the values in order of the variable on the x-axis.
ggplot(data = df,
mapping = aes(x = Year, y = nT)) +
geom_line()
Note: In the early production stage of research, I like to break the code into steps as above: (1) Import the data, (2) manipulate the data, and (3) plot the data. It is easier to document but it also introduces the potential for mistakes because of the intermediary objects in the environment (e.g., Torn.df, df).
Below you bring together the above code to create the time series of Kansas tornado frequency without producing intermediary objects.
readr::read_csv(file = here::here("data", "Tornadoes.csv")) |>
dplyr::mutate(Year = yr,
Month = as.integer(mo),
EF = mag) |>
dplyr::filter(st == "KS") |>
dplyr::group_by(Year) |>
dplyr::summarize(nT = dplyr::n()) |>
ggplot(mapping = aes(x = Year, y = nT)) +
geom_line() +
geom_point()Recall that the group_by() function allows you to repeat an operation depending on the value (or level) of some variable. For example to count the number of tornadoes by EF damage rating since 2007 and ignoring missing ratings
Torn.df |>
dplyr::filter(Year >= 2007, EF != -9) |>
dplyr::group_by(EF) |>
dplyr::summarize(Count = dplyr::n()) ## # A tibble: 6 × 2
## EF Count
## <dbl> <int>
## 1 0 8597
## 2 1 5180
## 3 2 1354
## 4 3 354
## 5 4 74
## 6 5 9
The result is a table listing the number of tornadoes grouped by EF rating.
Instead of printing the table, you create a bar chart using the geom_col() function.
Torn.df |>
dplyr::filter(Year >= 2007, EF != -9) |>
dplyr::group_by(EF) |>
dplyr::summarize(Count = dplyr::n()) |>
ggplot(mapping = aes(x = EF, y = Count)) +
geom_col()
The geom_bar() function counts the number of cases at each x position so you don’t need the group_by() and summarize() functions.
Torn.df |>
dplyr::filter(Year >= 2007, EF != -9) |>
ggplot(mapping = aes(x = EF)) +
geom_bar()
Improve the bar chart and to make it ready for publication.
Torn.df |>
dplyr::filter(Year >= 2007, EF != -9) |>
dplyr::group_by(EF) |>
dplyr::summarize(Count = dplyr::n()) |>
ggplot(mapping = aes(x = factor(EF), y = Count, fill = Count)) +
geom_bar(stat = "identity") +
xlab("EF Rating") +
ylab("Number of Tornadoes") +
scale_fill_continuous(low = 'green', high = 'orange') +
geom_text(aes(label = Count), vjust = -.5, size = 3) +
theme_minimal() +
theme(legend.position = 'none') 
You create a set of plots with the facet_wrap() function. Here you create a set of bar charts showing the frequency of tornadoes by EF rating for each year in the data set since 2004.
You add the function after the geom_bar() layer and use the formula syntax (~ Year) inside the parentheses. You interpret the syntax as “plot bar charts conditioned on the variable year.”
Torn.df |>
dplyr::filter(Year >= 2004, EF != -9) |>
ggplot(mapping = aes(x = factor(EF))) +
geom_bar() +
facet_wrap(~ Year)
Hot days in Tallahassee and Las Vegas
The data are daily weather observations from the Tallahassee International Airport.
Import the data.
loc <- "http://myweb.fsu.edu/jelsner/temp/data/TLH_Daily1940-2021.csv"
TLH.df <- readr::read_csv(file = loc)## Rows: 29997 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): STATION, NAME
## dbl (4): TAVG, TMAX, TMIN, TOBS
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The variables of interest are the daily high (and low) temperature in the column labeled TMAX (TMIN). The values are in degrees F.
Rename the columns then select only the date and temperature columns.
TLH.df <- TLH.df |>
dplyr::rename(TmaxF = TMAX,
TminF = TMIN,
Date = DATE) |>
dplyr::select(Date, TmaxF, TminF) |>
dplyr::glimpse()## Rows: 29,997
## Columns: 3
## $ Date <date> 1940-03-01, 1940-03-02, 1940-03-03, 1940-03-04, 1940-03-05, 194…
## $ TmaxF <dbl> 72, 77, 73, 72, 61, 66, 72, 56, 60, 72, 72, 65, 74, 63, 56, 73, …
## $ TminF <dbl> 56, 53, 56, 44, 45, 40, 36, 41, 33, 32, 37, 51, 59, 49, 37, 32, …
Q: Based on these data, is it getting hotter in Tallahassee? Let’s compute the annual average high temperature and create a time series graph.
You use the year() function from the {lubridate} package to get a column called Year. Since the data only has values through mid May 2022 you keep all observations with the Year column value less than 2021. You then use the group_by() function to group by Year, and the summarize() function to get the average daily maximum temperature for each year.
df <- TLH.df |>
dplyr::mutate(Year = lubridate::year(Date)) |>
dplyr::filter(Year < 2022) |>
dplyr::group_by(Year) |>
dplyr::summarize(AvgT = mean(TmaxF, na.rm = TRUE)) |>
dplyr::glimpse()## Rows: 82
## Columns: 2
## $ Year <dbl> 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950,…
## $ AvgT <dbl> 80.45425, 78.78904, 78.95808, 79.18356, 79.24863, 79.43014, 80.30…
You now have a data frame with two columns: Year and AvgT (annual average daily high temperature in degrees F). If a day is missing a value it is skipped when computing the average because of the na.rm = TRUE argument in the mean() function.
Next you use functions from the {ggplot2} package to make a time series graph. You specify the x aesthetic as Year and the y aesthetic as the AvgT and include point and line layers.
ggplot(data = df,
mapping = aes(x = Year, y = AvgT)) +
geom_point(size = 3) +
geom_line() +
ylab("Average Annual Temperature in Tallahassee, FL (F)")
You can go directly to the graph without saving the resulting data frame. That is, you pipe |> the resulting data frame after applying the {dplyr} verbs to the ggplot() function. The object in the first argument of the ggplot() function is the result (data frame) from the code above. Here you also add a smooth curve through the set of averages with the geom_smooth() layer.
TLH.df |>
dplyr::mutate(Year = lubridate::year(Date)) |>
dplyr::filter(Year < 2022) |>
dplyr::group_by(Year) |>
dplyr::summarize(AvgT = mean(TmaxF, na.rm = TRUE)) |>
ggplot(mapping = aes(x = Year, y = AvgT)) +
geom_point(size = 3) +
geom_line() +
ylab("Average Annual Temperature in Tallahassee, FL (F)") +
geom_smooth() +
theme_minimal()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Q: Is the frequency of extremely hot days increasing over time? Let’s consider a daily high temperature of 100 F and above as extremely hot.
Here you count the number of days at or above 100F using the summarize() function together with the sum() function on the logical operator >=. If a day is missing a temperature, you remove it with the na.rm = TRUE argument in the sum() function.
TLH.df |>
dplyr::mutate(Year = lubridate::year(Date)) |>
dplyr::filter(Year < 2022) |>
dplyr::group_by(Year) |>
dplyr::summarize(N100 = sum(TmaxF >= 100, na.rm = TRUE)) |>
ggplot(mapping = aes(x = Year, y = N100, fill = N100)) +
geom_bar(stat = 'identity') +
scale_fill_continuous(low = 'orange', high = 'red') +
geom_text(aes(label = N100), vjust = 1.5, size = 3) +
scale_x_continuous(breaks = seq(1940, 2020, 10)) +
labs(title = expression(paste("Number of days in Tallahassee, Florida at or above 100", {}^o, " F")),
subtitle = "Last official 100+ day was September 18, 2019",
x = "", y = "") +
# ylab(expression(paste("Number of days in Tallahassee, FL at or above 100", {}^o, " F"))) +
theme_minimal() +
theme(axis.text.x = element_text(size = 11), legend.position = "none")
What does the histogram of daily high temperatures look like?
( gTLH <- ggplot(data = TLH.df,
mapping = aes(x = TmaxF)) +
geom_histogram(binwidth = 1, aes(fill = ..count..)) +
scale_fill_continuous(low = 'green', high = 'blue') +
scale_x_continuous() +
scale_y_continuous() +
ylab("Number of Days") +
xlab(expression(paste("Daily High Temperature in Tallahassee, FL (", {}^o, " F)"))) +
theme_minimal() +
theme(legend.position = "none") )## Warning: Removed 2 rows containing non-finite values (stat_bin).

Q: The most common high temperatures are in the low 90s, but there are relatively few 100+ days. Why?
Compare the histogram of daily high temperatures in Tallahassee with a histogram of daily high temperatures in Las Vegas, Nevada. Here we repeat the code above but for the data frame LVG.df. We then use the operators from the {patchwork} package to plot them side by side.
LVG.df <- readr::read_csv(file = "http://myweb.fsu.edu/jelsner/temp/data/LV_DailySummary.csv",
na = "-9999")## New names:
## • `Measurement Flag` -> `Measurement Flag...8`
## • `Quality Flag` -> `Quality Flag...9`
## • `Source Flag` -> `Source Flag...10`
## • `Time of Observation` -> `Time of Observation...11`
## • `Measurement Flag` -> `Measurement Flag...13`
## • `Quality Flag` -> `Quality Flag...14`
## • `Source Flag` -> `Source Flag...15`
## • `Time of Observation` -> `Time of Observation...16`
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 23872 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): STATION, STATION_NAME, ELEVATION, LATITUDE, LONGITUDE, Source Flag....
## dbl (5): DATE, TMAX, Time of Observation...11, TMIN, Time of Observation...16
## lgl (4): Measurement Flag...8, Quality Flag...9, Measurement Flag...13, Qual...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
LVG.df <- LVG.df |>
dplyr::mutate(TmaxF = round(9/5 * TMAX/10 + 32),
TminF = round(9/5 * TMIN/10 + 32),
Date = as.Date(as.character(DATE),
format = "%Y%m%d")) |>
dplyr::select(Date, TmaxF, TminF)
gLVG <- ggplot(data = LVG.df,
mapping = aes(x = TmaxF)) +
geom_histogram(binwidth = 1, aes(fill = ..count..)) +
scale_fill_continuous(low = 'green', high = 'blue') +
scale_x_continuous() +
scale_y_continuous() +
ylab("Number of Days") +
xlab(expression(paste("Daily High Temperature in Las Vegas, NV (", {}^o, " F)"))) +
theme_minimal() +
theme(legend.position = "none")
#install.packages("patchwork")
library(patchwork)
gTLH / gLVG## Warning: Removed 2 rows containing non-finite values (stat_bin).

US population and area by state
The object us_states from the {spData} package is a data frame from the U.S. Census Bureau. The variables include the state GEOID and NAME, the REGION (South, West, etc), AREA (in square km), and total population in 2010 (total_pop_10) and in 2015 (total_pop_15).
us_states <- spData::us_states
class(us_states)## [1] "sf" "data.frame"
head(us_states)## GEOID NAME REGION AREA total_pop_10 total_pop_15
## 1 01 Alabama South 133709.27 4712651 4830620
## 2 04 Arizona West 295281.25 6246816 6641928
## 3 08 Colorado West 269573.06 4887061 5278906
## 4 09 Connecticut Norteast 12976.59 3545837 3593222
## 5 12 Florida South 151052.01 18511620 19645772
## 6 13 Georgia South 152725.21 9468815 10006693
## geometry
## 1 -88.20006, -88.20296, -87.42861, -86.86215, -85.60516, -85.47047, -85.30449, -85.18440, -85.12219, -85.10534, -85.00710, -84.96343, -85.00187, -84.89184, -85.05875, -85.05382, -85.14183, -85.12553, -85.05817, -85.04499, -85.09249, -85.10752, -85.03562, -85.00250, -85.89363, -86.52000, -87.59894, -87.63494, -87.53262, -87.40697, -87.44659, -87.42958, -87.51832, -87.65689, -87.75552, -87.90634, -87.90171, -87.93672, -88.00840, -88.10087, -88.10727, -88.20449, -88.33228, -88.39502, -88.43898, -88.47323, -88.40379, -88.33093, -88.21074, -88.09789, -88.20006, 34.99563, 35.00803, 35.00279, 34.99196, 34.98468, 34.32824, 33.48288, 32.86132, 32.77335, 32.64484, 32.52387, 32.42254, 32.32202, 32.26340, 32.13602, 32.01350, 31.83926, 31.69496, 31.62023, 31.51823, 31.36288, 31.18645, 31.10819, 31.00068, 30.99346, 30.99322, 30.99742, 30.86586, 30.74347, 30.67515, 30.52706, 30.40649, 30.28044, 30.24971, 30.29122, 30.40938, 30.55088, 30.65743, 30.68496, 30.50975, 30.37725, 30.36210, 30.38844, 30.36942, 31.24690, 31.89386, 32.44977, 33.07312, 34.02920, 34.89220, 34.99563
## 2 -114.71963, -114.53909, -114.46897, -114.50613, -114.67080, -114.70790, -114.67703, -114.72287, -114.62964, -114.55890, -114.49649, -114.53368, -114.46026, -114.41591, -114.25414, -114.13828, -114.34261, -114.47162, -114.63068, -114.63349, -114.57275, -114.59593, -114.67764, -114.65341, -114.68941, -114.71211, -114.66719, -114.73116, -114.73616, -114.57203, -114.37211, -114.23880, -114.15413, -114.04684, -114.05060, -112.96647, -112.35769, -111.41278, -110.50069, -110.47019, -109.62567, -109.04522, -109.04618, -109.04602, -109.04618, -109.04666, -109.04748, -109.04829, -109.05004, -109.82969, -111.07483, -111.56019, -112.39942, -113.12596, -113.78168, -114.81361, -114.80939, -114.71963, 32.71876, 32.75695, 32.84515, 33.01701, 33.03798, 33.09743, 33.27017, 33.39878, 33.42814, 33.53182, 33.69690, 33.92607, 33.99665, 34.10764, 34.17383, 34.30323, 34.45144, 34.71297, 34.86635, 35.00186, 35.13873, 35.32523, 35.48974, 35.61079, 35.65141, 35.80618, 35.87479, 35.94392, 36.10437, 36.15161, 36.14311, 36.01456, 36.02386, 36.19407, 37.00040, 37.00022, 37.00102, 37.00148, 37.00426, 36.99800, 36.99831, 36.99908, 36.18175, 35.17551, 34.52239, 33.62506, 33.06842, 32.08911, 31.33250, 31.33407, 31.33224, 31.48814, 31.75176, 31.97278, 32.17903, 32.49428, 32.61712, 32.71876
## 3 -109.05008, -108.25065, -107.62562, -106.21757, -105.73042, -104.85527, -104.05325, -102.86578, -102.05161, -102.05174, -102.04845, -102.04539, -102.04464, -102.04224, -103.00220, -104.33883, -105.00055, -106.20147, -106.86980, -107.42092, -108.24936, -109.04522, -109.04187, -109.04176, -109.06006, -109.05151, -109.05061, -109.05008, 41.00066, 41.00011, 41.00212, 40.99773, 40.99689, 40.99805, 41.00141, 41.00199, 41.00238, 40.00308, 39.30314, 38.81339, 38.04553, 36.99308, 37.00010, 36.99354, 36.99326, 36.99412, 36.99243, 37.00001, 36.99901, 36.99908, 37.53073, 38.16469, 38.27549, 39.12609, 39.87497, 41.00066
## 4 -73.48731, -72.99955, -71.80065, -71.79924, -71.78699, -71.79768, -71.86051, -71.94565, -72.38663, -72.45193, -72.96205, -73.13025, -73.17777, -73.33066, -73.38723, -73.49333, -73.65734, -73.72777, -73.48271, -73.55096, -73.48731, 42.04964, 42.03865, 42.02357, 42.00807, 41.65599, 41.41671, 41.32025, 41.33780, 41.26180, 41.27889, 41.25160, 41.14680, 41.16670, 41.11000, 41.05825, 41.04817, 40.98517, 41.10070, 41.21276, 41.29542, 42.04964
## 5 -81.81169, -81.74565, -81.44351, -81.30505, -81.25771, -81.40189, -81.51740, -81.68524, -81.81169, 24.56874, 24.65988, 24.81336, 24.75519, 24.66431, 24.62354, 24.62124, 24.55868, 24.56874, -85.00250, -84.93696, -84.91445, -84.86469, -84.05732, -83.13143, -82.21487, -82.21032, -82.17008, -82.04794, -82.01213, -82.03966, -81.94981, -81.69479, -81.44412, -81.38550, -81.25671, -81.16358, -80.96618, -80.70973, -80.57487, -80.52509, -80.58781, -80.60421, -80.56643, -80.38370, -80.25366, -80.09391, -80.03136, -80.03886, -80.10957, -80.12799, -80.24453, -80.33942, -80.32578, -80.35818, -80.46883, -80.46455, -80.66403, -80.74775, -80.81213, -80.90058, -81.07986, -81.17204, -81.11727, -81.29033, -81.35056, -81.38381, -81.46849, -81.62348, -81.68954, -81.80166, -81.83314, -81.91171, -82.01368, -82.10567, -82.13787, -82.18157, -82.06658, -82.07635, -82.17524, -82.14707, -82.24989, -82.44572, -82.56925, -82.70782, -82.58463, -82.39338, -82.41392, -82.62959, -82.58652, -82.63362, -82.73802, -82.85113, -82.82816, -82.86081, -82.76264, -82.67479, -82.65414, -82.66872, -82.73024, -82.68886, -82.76055, -82.75970, -82.82707, -82.99614, -83.16958, -83.21807, -83.40025, -83.41277, -83.53764, -83.63798, -84.02427, -84.15728, -84.26936, -84.36611, -84.33375, -84.53587, -84.82420, -84.91511, -84.99326, -85.12147, -85.04507, -85.22161, -85.31921, -85.30133, -85.40505, -85.48776, -85.58824, -85.69681, -85.87814, -86.08996, -86.29870, -86.63295, -86.90968, -87.15539, -87.29542, -87.51832, -87.42958, -87.44659, -87.40697, -87.53262, -87.63494, -87.59894, -86.52000, -85.89363, -85.00250, 31.00068, 30.88468, 30.75358, 30.71154, 30.67470, 30.62357, 30.56858, 30.42458, 30.35891, 30.36325, 30.59377, 30.74773, 30.82750, 30.74842, 30.70971, 30.27384, 29.78469, 29.55529, 29.14796, 28.75669, 28.58517, 28.45945, 28.41086, 28.25773, 28.09563, 27.74004, 27.37979, 27.01859, 26.79634, 26.56935, 26.08716, 25.77225, 25.71709, 25.49943, 25.39801, 25.15323, 25.09203, 25.20907, 25.18726, 25.14744, 25.18604, 25.13967, 25.11880, 25.22228, 25.35495, 25.68751, 25.68983, 25.77675, 25.80332, 25.89716, 25.85271, 26.08823, 26.29452, 26.42716, 26.49083, 26.48393, 26.63744, 26.68171, 26.74266, 26.95826, 26.91687, 26.78980, 26.76295, 27.06063, 27.29859, 27.48762, 27.59602, 27.83752, 27.90140, 27.99847, 27.81670, 27.71061, 27.70681, 27.88630, 28.02013, 28.21708, 28.21901, 28.44196, 28.59084, 28.69566, 28.85016, 28.90561, 28.99309, 29.05419, 29.15843, 29.17807, 29.29036, 29.42049, 29.51724, 29.66849, 29.72306, 29.88607, 30.10327, 30.07271, 30.09766, 30.00866, 29.92372, 29.91009, 29.75829, 29.78330, 29.71496, 29.71585, 29.58699, 29.67776, 29.68149, 29.79712, 29.93849, 29.96123, 30.05554, 30.09689, 30.21562, 30.30357, 30.36305, 30.39630, 30.37242, 30.32775, 30.32350, 30.28044, 30.40649, 30.52706, 30.67515, 30.74347, 30.86586, 30.99742, 30.99322, 30.99346, 31.00068
## 6 -85.60516, -84.32187, -83.61998, -83.10861, -83.12111, -83.23908, -83.32387, -83.33869, -83.16828, -83.00292, -82.90266, -82.74666, -82.71751, -82.59615, -82.55684, -82.34693, -82.25527, -82.19658, -82.04633, -81.91353, -81.94474, -81.84730, -81.82794, -81.70463, -81.61778, -81.49142, -81.50272, -81.42161, -81.41191, -81.28132, -81.11963, -81.15600, -81.11723, -81.00674, -80.88552, -80.84055, -81.13063, -81.13349, -81.17725, -81.27880, -81.27469, -81.42047, -81.40515, -81.44693, -81.44412, -81.69479, -81.94981, -82.03966, -82.01213, -82.04794, -82.17008, -82.21032, -82.21487, -83.13143, -84.05732, -84.86469, -84.91445, -84.93696, -85.00250, -85.03562, -85.10752, -85.09249, -85.04499, -85.05817, -85.12553, -85.14183, -85.05382, -85.05875, -84.89184, -85.00187, -84.96343, -85.00710, -85.10534, -85.12219, -85.18440, -85.30449, -85.47047, -85.60516, 34.98468, 34.98841, 34.98659, 35.00066, 34.93913, 34.87566, 34.78971, 34.68200, 34.59100, 34.47213, 34.48590, 34.26641, 34.15050, 34.03052, 33.94535, 33.83430, 33.75969, 33.63058, 33.56383, 33.44127, 33.36404, 33.30678, 33.22875, 33.11645, 33.09528, 33.00808, 32.93869, 32.83518, 32.61841, 32.55646, 32.28760, 32.24148, 32.11760, 32.10115, 32.03460, 32.01131, 31.72269, 31.62335, 31.51707, 31.36721, 31.28945, 31.01670, 30.90820, 30.81039, 30.70971, 30.74842, 30.82750, 30.74773, 30.59377, 30.36325, 30.35891, 30.42458, 30.56858, 30.62357, 30.67470, 30.71154, 30.75358, 30.88468, 31.00068, 31.10819, 31.18645, 31.36288, 31.51823, 31.62023, 31.69496, 31.83926, 32.01350, 32.13602, 32.26340, 32.32202, 32.42254, 32.52387, 32.64484, 32.77335, 32.86132, 33.48288, 34.32824, 34.98468
The object us_states has two classes: simple feature and data frame. It is a data frame that has spatial information stored in the column labeled geometry. More about this next lesson.
Note also that the variable AREA is numeric with units (km^2). Thus in order to perform some operations you need to specify units or convert the column using as.numeric(). For example, if you want to filter by area keeping only states with an area greater than 300,000 square km you could do the following
us_states |>
dplyr::mutate(Area = as.numeric(AREA)) |>
dplyr::filter(Area > 300000)For now, suppose you want to plot area versus population for each state including state names on the plot. We note large differences between the minimum and maximum values for both variables.
us_states |>
dplyr::summarize(rA = range(AREA),
rP = range(total_pop_15))## rA rP
## 1 178.21 579679
## 2 687714.28 38421464
Let’s start with a simple scatter plot using logarithmic scales. The variable AREA has units so you convert it to a numeric with the as.numeric() function.
ggplot(data = us_states,
mapping = aes(x = as.numeric(AREA),
y = total_pop_15)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
Next you use the {scales} package so the tic labels can be expressed in whole numbers with commas.
ggplot(data = us_states,
mapping = aes(x = as.numeric(AREA),
y = total_pop_15)) +
geom_point() +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma)
Next you add text labels. You can do this with geom_text() or geom_label()
ggplot(data = us_states,
mapping = aes(x = as.numeric(AREA),
y = total_pop_15)) +
geom_point() +
geom_text(aes(label = NAME)) +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma)
The labels are centered on top of the points. To fix this you use functions from the {grepel} package.
ggplot(data = us_states,
mapping = aes(x = as.numeric(AREA),
y = total_pop_15)) +
geom_point() +
ggrepel::geom_text_repel(aes(label = NAME)) +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma)## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Finally, since the data object is a simple feature data frame you can make a map.
ggplot() +
geom_sf(data = spData::us_states,
mapping = aes(fill = total_pop_15)) +
scale_fill_continuous(labels = scales::comma) +
theme_void()
More resources and additional examples
- ggplot extensions https://exts.ggplot2.tidyverse.org/
- Cheat sheets: https://rstudio.com/resources/cheatsheets/
- More examples: https://geocompr.robinlovelace.net/ {spData} package.
Thursday September 8, 2022
“An awful lot of time I spend”coding” is actually spent copying and pasting (And much of the rest is spent googling).” – Meghan Duffy
Today
- Working with spatial data
- Geo-computation on simple features
Working with spatial data
The vector model for data (vector data) represents things in the world using points, lines and polygons. These objects have discrete, well-defined borders and a high level of precision. Of course, precision does not imply accurate.
The raster model for data (raster data) represents continuous fields (like elevation and rainfall) using a grid of cells (raster). A raster aggregates fields to a given resolution, meaning that they are consistent over space and scale-able. The smallest features within the field are blurred or lost.
The choice of which data model to use depends on the application: Vector data tends to dominate the social sciences because human settlements and boundaries have discrete borders. Raster data (e.g., remotely sensed imagery) tends to dominate the environmental sciences because environmental conditions are typically continuous. Geographers, ecologists, demographers use vector and raster data.
Here we use functions from the {sf} package to work with vector data and functions in the {terra} and {raster} packages to work with raster data sets. We will also look at functions from the new {stars} package that work with both vector and raster data models.
R’s spatial ecosystem continues to evolve. Most changes build on what has already been done. Occasionally there is a significant change that builds from scratch. The introduction of the {sf} package in about 2018 (Edzer Pebesma) is a significant change.
Simple features
Simple features is a standard from the Open (‘open source’) Geospatial Consortium (OGC) to represent geographic information. It condenses geographic forms into a single geometry class.
The standard is used in spatial databases (e.g., PostGIS), commercial GIS (e.g., ESRI) and forms the vector data basis for libraries such as GDAL. A subset of simple features forms the GeoJSON standard. The {sf} package supports these classes and includes plotting and other methods.
Functions in the {sf} package work with all common vector geometry types: points, lines, polygons and their respective ‘multi’ versions (which group together features of the same type into a single feature). These functions also support geometry collections, which contain multiple geometry types in a single object. The raster data classes are not supported.
The {sf} package incorporates the three main packages of the spatial R ecosystem: {sp} for the class system, {rgdal} for reading and writing data, and {rgeos} for spatial operations done with GEOS.
Simple features are data frames with a special column for storing the spatial information. The spatial column is called geometry (or geom). The geometry column is referenced like a regular column.
The difference is that the geometry column is a ‘list column’ of class sfc (simple feature column). And the sfc is a set of objects of class sfg (simple feature geometries).
Simple Feature Anatomy
- green box is a simple feature: a single record, or
data.framerow, consisting of attributes and geometry - blue box is a single simple feature geometry (an object of class
sfg) - red box a simple feature list-column (an object of class
sfc, which is a column in thedata.frame) - the geometries are given in well-known text (WKT) format
Geometries are the building blocks of simple features. Well-known text (WKT) is the way simple feature geometries are coded. Well-known binaries (WKB) are hexadecimal strings readable by computers. GIS and spatial databases use WKB to transfer and store geometry objects. WKT is a human-readable text description of simple features. The two formats are exchangeable.
See: https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry
In WKT format a point is a coordinate in 2D, 3D or 4D space (see vignette("sf1") for more information) such as:
POINT (5 2)
The first number is the x coordinate and the second number is the y coordinate.
A line string is a sequence of points with a straight line connecting the points, for example:
LINESTRING (1 5, 4 4, 4 1, 2 2, 3 2)
Each pair of x and y coordinates is separated by a comma.
A polygon is a sequence of points that form a closed, non-intersecting ring. Closed means that the first and the last point of a polygon have the same coordinates. A polygon has one exterior boundary (outer ring) but it can have interior boundaries (inner rings). An inner ring is called a ‘hole’.
- Polygon without a hole -
POLYGON ((1 5, 2 2, 4 1, 4 4, 1 5))
Here there are two parentheses to start and two to end the string of coordinates.
- Polygon with one hole -
POLYGON ((1 5, 2 2, 4 1, 4 4, 1 5), (2 4, 3 4, 3 3, 2 3, 2 4))
Here the first set of coordinates defines the outer edge of the polygon and the next set of coordinates defines the hole. The outer edge vertexes are connected in a counterclockwise direction. The inner edge vertexes (defining the hole in the polygon) are connected in a clockwise direction.
Simple features allow multiple geometries (hence the term ‘geometry collection’) using ‘multi’ version of each geometry type:
- Multi-point -
MULTIPOINT (5 2, 1 3, 3 4, 3 2) - Multi-string -
MULTILINESTRING ((1 5, 4 4, 4 1, 2 2, 3 2), (1 2, 2 4)) - Multi-polygon -
MULTIPOLYGON (((1 5, 2 2, 4 1, 4 4, 1 5), (0 2, 1 2, 1 3, 0 3, 0 2)))
The difference (syntax wise) between a polygon with a hole and a multi-polygon is that the vertexes of each polygon are connected in a counterclockwise direction.
A collection of these is made:
- Geometry collection -
GEOMETRYCOLLECTION (MULTIPOINT (5 2, 1 3, 3 4, 3 2), LINESTRING (1 5, 4 4, 4 1, 2 2, 3 2)))
Simple feature geometry (sfg)
The sfg class represents the simple feature geometry types: point, line string, polygon (and their ‘multi’ equivalents, such as multi points) or geometry collection.
Usually you don’t need to create geometries. Geometries are typically part of the spatial data we import. However, there are a set of functions to create simple feature geometry objects (sfg) from scratch, if needed. The names of these functions are simple and consistent, as they all start with the st_ prefix and end with the name of the geometry type in lowercase letters:
- A point -
st_point() - A linestring -
st_linestring() - A polygon -
st_polygon() - A multipoint -
st_multipoint() - A multilinestring -
st_multilinestring() - A multipolygon -
st_multipolygon() - A geometry collection -
st_geometrycollection()
An sfg object can be created from three data types:
- A numeric vector - a single point
- A matrix - a set of points, where each row contains a point - a multi-point or line string
- A list - any other set, e.g. a multi-line string or geometry collection
To create point objects, you use the st_point() function from the {sf} package applied to a numeric vector.
sf::st_point(c(5, 2)) # XY point## POINT (5 2)
sf::st_point(c(5, 2, 3)) # XYZ point## POINT Z (5 2 3)
To create multi-point objects, you use matrices constructed from the rbind() function.
mp.matrix <- rbind(c(5, 2), c(1, 3), c(3, 4), c(3, 2))
mp.matrix## [,1] [,2]
## [1,] 5 2
## [2,] 1 3
## [3,] 3 4
## [4,] 3 2
sf::st_multipoint(mp.matrix)## MULTIPOINT ((5 2), (1 3), (3 4), (3 2))
ls.matrix <- rbind(c(1, 5), c(4, 4), c(4, 1), c(2, 2), c(3, 2))
sf::st_linestring(ls.matrix)## LINESTRING (1 5, 4 4, 4 1, 2 2, 3 2)
sf::st_linestring(ls.matrix)## LINESTRING (1 5, 4 4, 4 1, 2 2, 3 2)
plot(sf::st_multipoint(mp.matrix))
plot(sf::st_linestring(ls.matrix))
To create a polygon, you use lists.
poly.list <- list(rbind(c(1, 5), c(2, 2), c(4, 1), c(4, 4), c(1, 5)))
sf::st_polygon(poly.list)## POLYGON ((1 5, 2 2, 4 1, 4 4, 1 5))
poly.border <- rbind(c(1, 5), c(2, 2), c(4, 1), c(4, 4), c(1, 5))
poly.hole <- rbind(c(2, 4), c(3, 4), c(3, 3), c(2, 3), c(2, 4))
poly.with.hole.list <- list(poly.border, poly.hole)
sf::st_polygon(poly.with.hole.list)## POLYGON ((1 5, 2 2, 4 1, 4 4, 1 5), (2 4, 3 4, 3 3, 2 3, 2 4))
plot(sf::st_polygon(poly.list))
plot(sf::st_polygon(poly.with.hole.list))
Simple feature geometry column
One sfg object contains a single simple feature geometry. A simple feature geometry column (sfc) is a list of sfg objects together with information about the coordinate reference system.
For example, to combine two simple features into one object with two features, you use the st_sfc() function. This is important since sfg represents the geometry column in sf data frames.
point1 <- sf::st_point(c(5, 2))
point2 <- sf::st_point(c(1, 3))
sf::st_sfc(point1, point2)## Geometry set for 2 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1 ymin: 2 xmax: 5 ymax: 3
## CRS: NA
## POINT (5 2)
## POINT (1 3)
In most cases, a sfc object contains objects of the same geometry type. Thus, when you convert sfg objects of type polygon into a simple feature geometry column, you end up with an sfc object of type polygon. A geometry column of multiple line strings would result in an sfc object of type multilinestring.
An example with polygons.
poly.list1 <- list(rbind(c(1, 5), c(2, 2), c(4, 1), c(4, 4), c(1, 5)))
polygon1 <- sf::st_polygon(poly.list1)
poly.list2 <- list(rbind(c(0, 2), c(1, 2), c(1, 3), c(0, 3), c(0, 2)))
polygon2 <- sf::st_polygon(poly.list2)
sf::st_sfc(polygon1, polygon2)## Geometry set for 2 features
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 1 xmax: 4 ymax: 5
## CRS: NA
## POLYGON ((1 5, 2 2, 4 1, 4 4, 1 5))
## POLYGON ((0 2, 1 2, 1 3, 0 3, 0 2))
plot(sf::st_sfc(polygon1, polygon2))
An example with line strings.
mls.list1 <- list(rbind(c(1, 5), c(4, 4), c(4, 1), c(2, 2), c(3, 2)),
rbind(c(1, 2), c(2, 4)))
mls1 <- sf::st_multilinestring((mls.list1))
mls.list2 <- list(rbind(c(2, 9), c(7, 9), c(5, 6), c(4, 7), c(2, 7)),
rbind(c(1, 7), c(3, 8)))
mls2 <- sf::st_multilinestring((mls.list2))
sf::st_sfc(mls1, mls2)## Geometry set for 2 features
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: 1 ymin: 1 xmax: 7 ymax: 9
## CRS: NA
## MULTILINESTRING ((1 5, 4 4, 4 1, 2 2, 3 2), (1 ...
## MULTILINESTRING ((2 9, 7 9, 5 6, 4 7, 2 7), (1 ...
plot(sf::st_sfc(mls1, mls2))
An example with a geometry collection.
sf::st_sfc(point1, mls1)## Geometry set for 2 features
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: 1 ymin: 1 xmax: 5 ymax: 5
## CRS: NA
## POINT (5 2)
## MULTILINESTRING ((1 5, 4 4, 4 1, 2 2, 3 2), (1 ...
A sfc object also stores information on the coordinate reference systems (CRS). To specify a certain CRS, you use the epsg or proj4string attributes. The default value of epsg and proj4string is NA (Not Available).
sf::st_sfc(point1, point2)## Geometry set for 2 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1 ymin: 2 xmax: 5 ymax: 3
## CRS: NA
## POINT (5 2)
## POINT (1 3)
All geometries in an sfc object must have the same CRS. We add coordinate reference system as a crs = argument in st_sfc(). The argument accepts an integer with the epsg code (for example, 4326).
( sfc1 <- sf::st_sfc(point1, point2,
crs = 4326) )## Geometry set for 2 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 1 ymin: 2 xmax: 5 ymax: 3
## Geodetic CRS: WGS 84
## POINT (5 2)
## POINT (1 3)
The epsg code is translated to a well-known text (WKT) representation of the CRS.
sf::st_crs(sfc1)## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
Here the WKT describes a two-dimensional geographic coordinate reference system (GEOCRS) with a latitude axis first, then a longitude axis. The coordinate system is related to Earth by the WGS84 geodetic datum.
See: https://en.wikipedia.org/wiki/Well-known_text_representation_of_coordinate_reference_systems
Simple feature data frames
Features (geometries) typically come with attributes. The attributes might represent the name of the geometry, measured values, groups to which the geometry belongs, etc.
The simple feature class, sf, is a combination of an attribute table (data.frame) and a simple feature geometry column (sfc). Simple features are created using the st_sf() function.
Objects of class sf behave like regular data frames.
methods(class = "sf")## [1] [ [[<- $<- aggregate as.data.frame
## [6] cbind coerce filter identify initialize
## [11] merge plot print rbind show
## [16] slotsFromS3 transform
## see '?methods' for accessing help and source code
Simple features have two classes, sf and data.frame. This is central to the concept of simple features: most of the time a sf can be treated as, and behaves like, a data.frame. Simple features are, in essence, data frames but with a column containing the geometric information.
I refer to simple feature objects redundantly as ‘simple feature data frames’ to distinguish them from S4 class spatial data frames.
Many of these functions were developed for data frames including rbind() (for binding rows of data together) and $ (for creating new columns). The key feature of sf objects is that they store spatial and non-spatial data in the same way, as columns in a data.frame.
The geometry column of {sf} objects is typically called geometry but any name can be used.
Thus sf objects take advantage of R’s data analysis capabilities to be used on geographic data. It’s worth reviewing how to discover basic properties of vector data objects.
For example, we get information about the size and breadth of the world simple feature data frame from the {spData} package using dim(), nrow(), etc.
library(spData)
dim(world)## [1] 177 11
nrow(world)## [1] 177
ncol(world)## [1] 11
The data contains ten non-geographic columns (and one geometry column) with 177 rows each one representing a country.
Extracting the attribute data from an sf object is the same as dropping its geometry.
world.df <- world |>
sf::st_drop_geometry()
class(world.df)## [1] "tbl_df" "tbl" "data.frame"
Example: Temperatures at FSU and at the airport
Suppose you measure a temperature of 25C at FSU and 22C at the airport at 9a on September 6, 2022. Thus, you have specific points in space (the coordinates), the name of the locations (FSU, Airport), temperature values and the date of the measurement. Other attributes might include a urbanity category (campus or city), or a remark if the measurement was made with an automatic station.
Start by creating to sfg (simple feature geometry) point objects.
FSU.point <- sf::st_point(c(-84.29849, 30.44188))
TLH.point <- sf::st_point(c(-84.34505, 30.39541))Then combine the point objects into a sfc (simple feature column) object.
our.geometry <- sf::st_sfc(FSU.point, TLH.point,
crs = 4326)Then create a data frame of attributes.
our.attributes <- data.frame(name = c("FSU", "Airport"),
temperature = c(10, 5),
date = c(as.Date("2021-01-27"), as.Date("2021-01-27")),
category = c("campus", "airport"),
automatic = c(TRUE, FALSE))Finally create a simple feature data frame.
sfdf <- sf::st_sf(our.attributes,
geometry = our.geometry)The example illustrates the components of sf objects. First, you use coordinates to define the geometry of the simple feature geometry (sfg). Second, you can combine the geometries into a simple feature geometry column (sfc) which also stores the CRS. Third, you store the attribute information on the geometries in a data.frame. Fourth, you use the st_sf() function to combine the attribute table and the sfc object into a sf object.
sfdf## Simple feature collection with 2 features and 5 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -84.34505 ymin: 30.39541 xmax: -84.29849 ymax: 30.44188
## Geodetic CRS: WGS 84
## name temperature date category automatic geometry
## 1 FSU 10 2021-01-27 campus TRUE POINT (-84.29849 30.44188)
## 2 Airport 5 2021-01-27 airport FALSE POINT (-84.34505 30.39541)
class(sfdf)## [1] "sf" "data.frame"
Given a simple feature data frame you create a non-spatial data frame with a geometry list-column but that is not of class sf using the as.data.frame() function.
df <- as.data.frame(sfdf)
class(df)## [1] "data.frame"
In this case the geometry column is
- no longer a
sfc. - no longer has a plot method, and
- lacks all dedicated methods listed above for class
sf
In order to avoid any confusion it might be better to use the st_drop_geometry() column instead.
Example: US states
The object us_states from the {spData} package is a simple feature data frame from the U.S. Census Bureau. The variables include the name, region, area, and population.
Simple feature data frames can be treated as regular data frames. But the geometry is “sticky”. For example when we create a new data frame containing only the population information the geometry column is included in the new data frame.
df1 <- us_states |>
dplyr::select(starts_with("total"))
head(df1)## Simple feature collection with 6 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -114.8136 ymin: 24.55868 xmax: -71.78699 ymax: 42.04964
## Geodetic CRS: NAD83
## total_pop_10 total_pop_15 geometry
## 1 4712651 4830620 MULTIPOLYGON (((-88.20006 3...
## 2 6246816 6641928 MULTIPOLYGON (((-114.7196 3...
## 3 4887061 5278906 MULTIPOLYGON (((-109.0501 4...
## 4 3545837 3593222 MULTIPOLYGON (((-73.48731 4...
## 5 18511620 19645772 MULTIPOLYGON (((-81.81169 2...
## 6 9468815 10006693 MULTIPOLYGON (((-85.60516 3...
The resulting data frame has the two population columns but also a column labeled geometry.
When we use the summarize() function, a union of the geometries across rows is made.
df2 <- us_states |>
dplyr::filter(REGION == "Midwest") |>
dplyr::summarize(TotalPop2010 = sum(total_pop_10),
TotalPop2015 = sum(total_pop_15))
head(df2)## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -104.0577 ymin: 35.99568 xmax: -80.51869 ymax: 49.38436
## Geodetic CRS: NAD83
## TotalPop2010 TotalPop2015 geometry
## 1 66514091 67546398 MULTIPOLYGON (((-85.48703 4...
Why use the {sf} package when {sp} is already tried and tested?
- Fast reading and writing of data
- Enhanced plotting performance
sfobjects are treated as data frames in most operationssffunctions are combined using|>operator and they work well with the {tidyverse} packagessffunction names are consistent and intuitive (all begin withst_)
These advantages led to the development of spatial packages (including {tmap}, {mapview} and {tidycensus}) that now support simple feature objects.
It is easy to convert between the two classes. Consider the world S3 spatial data frame from the {spData} package. You convert it to a S4 spatial data frame with the as() method.
world.sp <- world |>
as(Class = "Spatial")The method coerces simple features to Spatial* and Spatial*DataFrame objects.
You convert a S4 spatial data frame into a simple feature data frame with the st_as_sf() function.
world.sf <- world.sp |>
sf::st_as_sf()You can create basic maps from simple feature data frames with the base plot() method (plot.sf()). The function creates a multi-panel one sub-plot for each variable.
Geo-computation on simple features
Geo-computation on simple features is done with routines in the geometry engine-open source (GEOS) library that functions in the {sf} package make use of.
As an example, consider the file police.zip on my website that contains shapefiles in a folder called police. The variables include police expenditures (POLICE), crime (CRIME), income (INC), unemployment (UNEMP) and other socio-economic variables for counties in Mississippi.
Input the data using the st_read() function from the {sf} package and then assign a geographic coordinate reference system (CRS) to it using the EPSG number 4326.
download.file(url = "http://myweb.fsu.edu/jelsner/temp/data/police.zip",
destfile = here::here("data", "police.zip"))
unzip(here::here("data", "police.zip"),
exdir = here::here("data"))
sfdf <- sf::st_read(dsn = here::here("data", "police"),
layer = "police") ## Reading layer `police' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/police' using driver `ESRI Shapefile'
## Simple feature collection with 82 features and 21 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -91.64356 ymin: 30.19474 xmax: -88.09043 ymax: 35.00496
## CRS: NA
sf::st_crs(sfdf) <- 4326The geometries are polygons and there are 82 of them, one for each county.
You transform the geographic coordinate system of the polygons to a specific projected CRS as suggested by the function suggest_crs() from the {crsuggest} package.
crsuggest::suggest_crs(sfdf)## # A tibble: 10 × 6
## crs_code crs_name crs_type crs_gcs crs_units crs_proj4
## <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 6508 NAD83(2011) / Mississippi TM project… 6318 m +proj=tm…
## 2 3816 NAD83(NSRS2007) / Mississippi … project… 4759 m +proj=tm…
## 3 3815 NAD83(HARN) / Mississippi TM project… 4152 m +proj=tm…
## 4 3814 NAD83 / Mississippi TM project… 4269 m +proj=tm…
## 5 6510 NAD83(2011) / Mississippi West… project… 6318 us-ft +proj=tm…
## 6 6509 NAD83(2011) / Mississippi West project… 6318 m +proj=tm…
## 7 3600 NAD83(NSRS2007) / Mississippi … project… 4759 us-ft +proj=tm…
## 8 3599 NAD83(NSRS2007) / Mississippi … project… 4759 m +proj=tm…
## 9 2900 NAD83(HARN) / Mississippi West… project… 4152 us-ft +proj=tm…
## 10 2814 NAD83(HARN) / Mississippi West project… 4152 m +proj=tm…
The function for transforming the CRS is st_transform() from the {sf} package.
sfdf <- sfdf |>
sf::st_transform(crs = 6508)The st_centroid() function computes the geographic center of each polygon in the spatial data frame.
countyCenters.sf <- sfdf |>
sf::st_centroid()## Warning in st_centroid.sf(sfdf): st_centroid assumes attributes are constant
## over geometries of x
The warning lets you know that the attributes attached to each polygon might result in misleading information when attached to the new geometry (points). Different geometries can mean different interpretations of the attribute.
sf::st_geometry(countyCenters.sf)## Geometry set for 82 features
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 347970.3 ymin: 1069814 xmax: 639099.2 ymax: 1565402
## Projected CRS: NAD83(2011) / Mississippi TM
## First 5 geometries:
## POINT (607909.2 1565402)
## POINT (639099.2 1550199)
## POINT (577860 1552732)
## POINT (552191 1557790)
## POINT (478087.6 1564075)
To get the centroid location for the state, you first join all the counties using the st_union() function, then use the st_centroid() function.
stateCenter.sfc <- sfdf |>
sf::st_union() |>
sf::st_centroid()The result is a simple feature geometry column (sfc) with a single row where the geometry contains the centroid location.
Which county contains the geographic center of the state? Here you use the geometric binary predicate st_contains().
( Contains <- sfdf |>
sf::st_contains(stateCenter.sfc,
sparse = FALSE) )## [,1]
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE
## [7,] FALSE
## [8,] FALSE
## [9,] FALSE
## [10,] FALSE
## [11,] FALSE
## [12,] FALSE
## [13,] FALSE
## [14,] FALSE
## [15,] FALSE
## [16,] FALSE
## [17,] FALSE
## [18,] FALSE
## [19,] FALSE
## [20,] FALSE
## [21,] FALSE
## [22,] FALSE
## [23,] FALSE
## [24,] FALSE
## [25,] FALSE
## [26,] FALSE
## [27,] FALSE
## [28,] FALSE
## [29,] FALSE
## [30,] FALSE
## [31,] FALSE
## [32,] FALSE
## [33,] FALSE
## [34,] FALSE
## [35,] FALSE
## [36,] FALSE
## [37,] FALSE
## [38,] FALSE
## [39,] FALSE
## [40,] FALSE
## [41,] FALSE
## [42,] FALSE
## [43,] FALSE
## [44,] TRUE
## [45,] FALSE
## [46,] FALSE
## [47,] FALSE
## [48,] FALSE
## [49,] FALSE
## [50,] FALSE
## [51,] FALSE
## [52,] FALSE
## [53,] FALSE
## [54,] FALSE
## [55,] FALSE
## [56,] FALSE
## [57,] FALSE
## [58,] FALSE
## [59,] FALSE
## [60,] FALSE
## [61,] FALSE
## [62,] FALSE
## [63,] FALSE
## [64,] FALSE
## [65,] FALSE
## [66,] FALSE
## [67,] FALSE
## [68,] FALSE
## [69,] FALSE
## [70,] FALSE
## [71,] FALSE
## [72,] FALSE
## [73,] FALSE
## [74,] FALSE
## [75,] FALSE
## [76,] FALSE
## [77,] FALSE
## [78,] FALSE
## [79,] FALSE
## [80,] FALSE
## [81,] FALSE
## [82,] FALSE
You include the sparse = FALSE argument so the result is a matrix containing TRUEs and FALSEs. Since there are 82 counties and one centroid the matrix has 82 rows and 1 column. All matrix entries are FALSE except the one containing the center.
To map the result you first plot the county polygons, then add the county geometry for the center county and fill it red. Note that you use the matrix you called Contains to subset this county. Finally you add the location of the state centroid to the plot.
library(ggplot2)
ggplot(data = sfdf) +
geom_sf() +
geom_sf(data = sfdf[Contains, ], col = "red") +
geom_sf(data = stateCenter.sfc) +
theme_void()
The function st_area() returns a vector of the geographical area (in sq. units) of each of the spatial objects. Here county boundaries as polygons.
sfdf |>
sf::st_area()## Units: [m^2]
## [1] 1059845789 1129805491 1177795196 1063555913 1292308911 1837313947
## [7] 1209563206 1050434214 1081610367 1111684631 1789727469 1873807962
## [13] 1537935343 1182321227 1047492468 1387412534 1298534836 1305083703
## [19] 1689710519 1539012925 2325208833 2002801262 1320449739 1821075582
## [25] 1165056081 1566560661 1062455271 1313215090 1103428267 1584296774
## [31] 1053624790 1195801329 1966286085 1077116415 2004569381 1127253959
## [37] 1933657382 1811075898 1604263674 1162255863 2412017942 1136584592
## [43] 1491632367 1497746748 1952707326 1894657055 1566911457 1653697972
## [49] 2045637420 1824649519 1481229225 2294433604 1797593028 1798362792
## [55] 1625787733 1254286842 1564126825 2051293154 2027900509 1366995434
## [61] 1858641378 1066358984 1072169833 1117271183 1208280395 1493574538
## [67] 1450255735 1398323080 1284051541 1225628132 1695604706 1798766462
## [73] 1811701938 1955125869 1067233035 1063132758 2151445783 1205245026
## [79] 1177906764 1834858322 1533395797 1238226953
The vector values have units of square meters (m^2), which are derived from the CRS.
There is an attribute called AREA in the data frame but it is better to calculate it from the spatial polygons because then you are sure of the units.
What happens when you apply the area function on the centroid object?
countyCenters.sf |>
sf::st_area()## Units: [m^2]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [77] 0 0 0 0 0 0
Compute a 10 km buffer around the state and show the result with a plot. First use st_union(), then st_buffer(), then pipe the output (a simple feature data frame to ggplot()).
sfdf |>
sf::st_union() |>
sf::st_buffer(dist = 10000) |>
ggplot() +
geom_sf() +
geom_sf(data = sfdf) +
theme_void()
Length of boundary lines for U.S. states. Transform the CRS to 2163 (US National Atlas Equal Area). Note that the geometry is multi-polygons. Convert the polygons to multi-linestrings, then use st_length() to get the total length of the lines.
states <- spData::us_states |>
sf::st_transform(crs = 2163)
sf::st_length(states) # returns zeroes because geometry is polygon## Units: [m]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0
states |>
sf::st_cast(to = "MULTILINESTRING") |>
sf::st_length()## Units: [m]
## [1] 1710433.26 2298001.84 2102155.22 514571.45 2949041.87 1746686.74
## [7] 2570625.84 1436260.52 1957704.71 2216035.60 1018647.33 2567370.61
## [13] 2112070.19 2826608.77 2338551.23 747093.89 2296339.40 1771057.91
## [19] 2334844.08 1497781.59 1275536.83 1998187.83 4959188.68 778627.31
## [25] 1592462.66 1680675.61 3810365.04 407944.22 60329.77 1859164.04
## [31] 1656525.21 1835778.08 1572793.19 1638356.40 3579150.33 1711366.62
## [37] 2051115.13 782857.63 2378176.63 2221531.40 1467821.30 2198130.04
## [43] 304698.28 1855108.54 1972765.77 2407304.31 2344371.69 1900586.84
## [49] 2029017.32
Tuesday September 13, 2022
“Hell isn’t other people’s code. Hell is your own code from 3 years ago.” – Jeff Atwood
Today
- Spatial data subsets and joins
- Interpolating variables using areal weights
Spatial data subsets and joins
Variables (stored as columns) in spatial data structures are referred to as ‘attributes’.
With simple feature data frames you can create data subsets using [, subset() and $ from the {base} R packages and select() and filter() from the {dplyr} package.
The [ operator subsets rows and columns. Indexes specify the elements you wish to extract from an object, e.g. object[i, j], with i and j typically being numbers representing rows and columns. Leaving i or j empty returns all rows or columns, so world[1:5, ] returns the first five rows and all columns of the simple feature data frame world (from the {spData} package). Examples
world <- spData::world
world[c(1, 5, 9), ] # subset rows by row position## Simple feature collection with 3 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -18.28799 xmax: 180 ymax: 71.35776
## Geodetic CRS: WGS 84
## # A tibble: 3 × 11
## iso_a2 name_long continent region_un subregion type area_km2 pop lifeExp
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 FJ Fiji Oceania Oceania Melanesia Sove… 19290. 8.86e5 70.0
## 2 US United Sta… North Am… Americas Northern… Coun… 9510744. 3.19e8 78.8
## 3 ID Indonesia Asia Asia South-Ea… Sove… 1819251. 2.55e8 68.9
## # … with 2 more variables: gdpPercap <dbl>, geom <MULTIPOLYGON [°]>
world[, 1:3] # subset columns by column position## Simple feature collection with 177 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 4
## iso_a2 name_long continent geom
## <chr> <chr> <chr> <MULTIPOLYGON [°]>
## 1 FJ Fiji Oceania (((-180 -16.55522, -179.9174 -16.50178…
## 2 TZ Tanzania Africa (((33.90371 -0.95, 31.86617 -1.02736, …
## 3 EH Western Sahara Africa (((-8.66559 27.65643, -8.817828 27.656…
## 4 CA Canada North America (((-132.71 54.04001, -133.18 54.16998,…
## 5 US United States North America (((-171.7317 63.78252, -171.7911 63.40…
## 6 KZ Kazakhstan Asia (((87.35997 49.21498, 86.82936 49.8266…
## 7 UZ Uzbekistan Asia (((55.96819 41.30864, 57.09639 41.3223…
## 8 PG Papua New Guinea Oceania (((141.0002 -2.600151, 141.0171 -5.859…
## 9 ID Indonesia Asia (((104.37 -1.084843, 104.0108 -1.05921…
## 10 AR Argentina South America (((-68.63401 -52.63637, -68.63335 -54.…
## # … with 167 more rows
world[, c("name_long", "lifeExp")] # subset columns by name## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 3
## name_long lifeExp geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Fiji 70.0 (((-180 -16.55522, -179.9174 -16.50178, -179.7933 -…
## 2 Tanzania 64.2 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -1.0…
## 3 Western Sahara NA (((-8.66559 27.65643, -8.817828 27.65643, -8.794884…
## 4 Canada 82.0 (((-132.71 54.04001, -133.18 54.16998, -133.2397 53…
## 5 United States 78.8 (((-171.7317 63.78252, -171.7911 63.40585, -171.553…
## 6 Kazakhstan 71.6 (((87.35997 49.21498, 86.82936 49.82667, 85.54127 4…
## 7 Uzbekistan 71.0 (((55.96819 41.30864, 57.09639 41.32231, 56.93222 4…
## 8 Papua New Guinea 65.2 (((141.0002 -2.600151, 141.0171 -5.859022, 141.0339…
## 9 Indonesia 68.9 (((104.37 -1.084843, 104.0108 -1.059212, 103.4376 -…
## 10 Argentina 76.3 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.562…
## # … with 167 more rows
Here you use logical vectors to create a subset. First create a logical vector sel_area.
sel_area <- world$area_km2 < 10000
head(sel_area)## [1] FALSE FALSE FALSE FALSE FALSE FALSE
summary(sel_area)## Mode FALSE TRUE
## logical 170 7
Then select only cases from the world simple feature data frame where the elements of the sel_area vector are TRUE.
small_countries <- world[sel_area, ]This creates a new simple feature data frame, small_countries, containing nations whose surface area is smaller than 10,000 square kilometers.
Note: there is no harm in keeping the geometry column because an operation on a {sf} object only changes the geometry when appropriate (e.g. by dissolving borders between adjacent polygons following aggregation). This means that the speed of operations with attribute data in {sf} objects is the same as with columns in a data frames.
The {base} R function subset() provides another way to get the same result.
small_countries <- subset(world,
area_km2 < 10000)The {dplyr} verbs work on {sf} spatial data frames. The functions include dplyr::select() and dplyr::filter().
CAUTION! The {dplyr} and {raster} packages have a select() function. When using both packages in the same session, the function in the most recently attached package will be used, ‘masking’ the other function. This will generate error messages containing text like: unable to find an inherited method for function ‘select’ for signature “sf”. To avoid this error message, and prevent ambiguity, you should always use the long-form function name, prefixed by the package name and two colons dplyr::select().
The dplyr::select() function picks the columns by name or position. For example, you can select only two columns, name_long and pop, with the following command.
world1 <- world |>
dplyr::select(name_long, pop)
names(world1)## [1] "name_long" "pop" "geom"
The result is a simple feature data frame with the geometry column.
With the select() function you can subset and rename columns at the same time. Here you select the columns with names name_long and pop and give the pop column a new name (population).
world |>
dplyr::select(name_long,
population = pop)## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 3
## name_long population geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Fiji 885806 (((-180 -16.55522, -179.9174 -16.50178, -179.793…
## 2 Tanzania 52234869 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -…
## 3 Western Sahara NA (((-8.66559 27.65643, -8.817828 27.65643, -8.794…
## 4 Canada 35535348 (((-132.71 54.04001, -133.18 54.16998, -133.2397…
## 5 United States 318622525 (((-171.7317 63.78252, -171.7911 63.40585, -171.…
## 6 Kazakhstan 17288285 (((87.35997 49.21498, 86.82936 49.82667, 85.5412…
## 7 Uzbekistan 30757700 (((55.96819 41.30864, 57.09639 41.32231, 56.9322…
## 8 Papua New Guinea 7755785 (((141.0002 -2.600151, 141.0171 -5.859022, 141.0…
## 9 Indonesia 255131116 (((104.37 -1.084843, 104.0108 -1.059212, 103.437…
## 10 Argentina 42981515 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.…
## # … with 167 more rows
The dplyr::pull() function returns a single vector without the geometry.
world |>
dplyr::pull(pop)## [1] 885806 52234869 NA 35535348 318622525 17288285
## [7] 30757700 7755785 255131116 42981515 17613798 73722860
## [13] 13513125 46024250 37737913 13569438 10572466 10405844
## [19] 143819666 382169 NA NA 56295 NA
## [25] 1212814 54539571 2145785 124221600 3419546 204213133
## [31] 10562159 30973354 47791911 3903986 4757575 6013997
## [37] 8809216 6281189 15923559 351694 30738378 763393
## [43] 547928 NA 15903112 3534874 2862087 11439767
## [49] 15411675 2168573 2370992 14546111 16962846 4063920
## [55] 10286712 19148219 176460502 22239904 7228915 26962563
## [61] 22531350 11805509 1725744 4390737 7079162 17585977
## [67] 4515392 4871101 1875713 1129424 15620974 17068838
## [73] 27212382 1295097 26920466 9891790 8215700 5603279
## [79] 23589801 4294682 1917852 11143908 39113313 8809306
## [85] 9070867 2374419 3782450 35006080 3960925 258850
## [91] 15270790 68416772 6576397 51924182 92544915 25116363
## [97] 50746659 2923896 1293859294 159405279 776448 28323241
## [103] 185546257 32758020 8362745 5835500 5466241 78411092
## [109] 19203090 2906220 9696110 9474511 45271947 38011735
## [115] 8546356 9866468 3556397 19908979 2932367 1993782
## [121] 1314545 80982500 7223938 10892413 77030628 2889104
## [127] 4238389 8188649 556319 11209057 16865008 10401062
## [133] 46480882 4657740 268050 575504 4509700 23504138
## [139] 20771000 1364270000 NA 60789140 5643475 64613160
## [145] 327386 9535079 3727000 100102249 30228017 411704
## [151] 2061980 5461512 5418649 10525347 NA 127276000
## [157] 6552584 26246327 30776722 NA NA 1152309
## [163] 34318082 91812566 6204108 97366774 912164 NA
## [169] 38833338 11345357 3566002 2077495 7130576 621810
## [175] 1821800 1354493 11530971
The filter() function keeps only rows matching given criteria, e.g., only countries with a very high average life expectancy.
world |>
sf::st_drop_geometry() |>
dplyr::filter(lifeExp > 82)## # A tibble: 9 × 10
## iso_a2 name_long continent region_un subregion type area_km2 pop lifeExp
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 IL Israel Asia Asia Western … Coun… 22991. 8.22e6 82.2
## 2 SE Sweden Europe Europe Northern… Sove… 450582. 9.70e6 82.3
## 3 CH Switzerland Europe Europe Western … Sove… 46185. 8.19e6 83.2
## 4 LU Luxembourg Europe Europe Western … Sove… 2417. 5.56e5 82.2
## 5 ES Spain Europe Europe Southern… Sove… 502306. 4.65e7 83.2
## 6 AU Australia Oceania Oceania Australi… Coun… 7687614. 2.35e7 82.3
## 7 IT Italy Europe Europe Southern… Sove… 315105. 6.08e7 83.1
## 8 IS Iceland Europe Europe Northern… Sove… 107736. 3.27e5 82.9
## 9 JP Japan Asia Asia Eastern … Sove… 404620. 1.27e8 83.6
## # … with 1 more variable: gdpPercap <dbl>
Aggregation summarizes a data frame by a grouping variable. An example of aggregation is to calculate the number of people per continent based on country-level data (one row per country).
This is done with the dplyr::group_by() and dplyr::summarize() functions.
world |>
dplyr::group_by(continent) |>
dplyr::summarize(Population = sum(pop, na.rm = TRUE),
nCountries = dplyr::n())## Simple feature collection with 8 features and 3 fields
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 8 × 4
## continent Population nCountries geom
## <chr> <dbl> <int> <GEOMETRY [°]>
## 1 Africa 1154946633 51 MULTIPOLYGON (((43.1453 11.4620…
## 2 Antarctica 0 1 MULTIPOLYGON (((-180 -89.9, 180…
## 3 Asia 4311408059 47 MULTIPOLYGON (((104.37 -1.08484…
## 4 Europe 669036256 39 MULTIPOLYGON (((-180 64.97971, …
## 5 North America 565028684 18 MULTIPOLYGON (((-132.71 54.0400…
## 6 Oceania 37757833 7 MULTIPOLYGON (((-180 -16.55522,…
## 7 Seven seas (open ocean) 0 1 POLYGON ((68.935 -48.625, 68.86…
## 8 South America 412060811 13 MULTIPOLYGON (((-66.95992 -54.8…
The two columns in the resulting table are Population and nCountries. The functions sum() and dplyr::n() were the aggregating functions.
The result is a simple feature data frame with a single row representing attributes of the world and the geometry as a single multi-polygon through the geometric union operator.
You can chain together functions to find the world’s three most populous continents and the number of countries they contain.
world |>
dplyr::select(pop, continent) |>
dplyr::group_by(continent) |>
dplyr::summarize(Population = sum(pop, na.rm = TRUE),
nCountries = dplyr::n()) |>
dplyr::top_n(n = 3, wt = Population) ## Simple feature collection with 3 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -34.81917 xmax: 180 ymax: 81.2504
## Geodetic CRS: WGS 84
## # A tibble: 3 × 4
## continent Population nCountries geom
## * <chr> <dbl> <int> <MULTIPOLYGON [°]>
## 1 Africa 1154946633 51 (((43.1453 11.46204, 42.71587 11.73564, 43.28…
## 2 Asia 4311408059 47 (((104.37 -1.084843, 104.0108 -1.059212, 103.…
## 3 Europe 669036256 39 (((-180 64.97971, -179.4327 65.40411, -179.88…
If you want to create a new column based on existing columns use dplyr::mutate(). For example, if you want to calculate population density for each country divide the population column, here pop, by an area column, here area_km2 with unit area in square kilometers.
world |>
dplyr::mutate(Population_Density = pop / area_km2)## Simple feature collection with 177 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 12
## iso_a2 name_long continent region_un subregion type area_km2 pop lifeExp
## * <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 FJ Fiji Oceania Oceania Melanesia Sove… 1.93e4 8.86e5 70.0
## 2 TZ Tanzania Africa Africa Eastern … Sove… 9.33e5 5.22e7 64.2
## 3 EH Western … Africa Africa Northern… Inde… 9.63e4 NA NA
## 4 CA Canada North Am… Americas Northern… Sove… 1.00e7 3.55e7 82.0
## 5 US United S… North Am… Americas Northern… Coun… 9.51e6 3.19e8 78.8
## 6 KZ Kazakhst… Asia Asia Central … Sove… 2.73e6 1.73e7 71.6
## 7 UZ Uzbekist… Asia Asia Central … Sove… 4.61e5 3.08e7 71.0
## 8 PG Papua Ne… Oceania Oceania Melanesia Sove… 4.65e5 7.76e6 65.2
## 9 ID Indonesia Asia Asia South-Ea… Sove… 1.82e6 2.55e8 68.9
## 10 AR Argentina South Am… Americas South Am… Sove… 2.78e6 4.30e7 76.3
## # … with 167 more rows, and 3 more variables: gdpPercap <dbl>,
## # geom <MULTIPOLYGON [°]>, Population_Density <dbl>
world |>
dplyr::transmute(Population_Density = pop / area_km2)## Simple feature collection with 177 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 2
## Population_Density geom
## * <dbl> <MULTIPOLYGON [°]>
## 1 45.9 (((-180 -16.55522, -179.9174 -16.50178, -179.7933 -16.020…
## 2 56.0 (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -1.01455, …
## 3 NA (((-8.66559 27.65643, -8.817828 27.65643, -8.794884 27.12…
## 4 3.54 (((-132.71 54.04001, -133.18 54.16998, -133.2397 53.85108…
## 5 33.5 (((-171.7317 63.78252, -171.7911 63.40585, -171.5531 63.3…
## 6 6.33 (((87.35997 49.21498, 86.82936 49.82667, 85.54127 49.6928…
## 7 66.7 (((55.96819 41.30864, 57.09639 41.32231, 56.93222 41.8260…
## 8 16.7 (((141.0002 -2.600151, 141.0171 -5.859022, 141.0339 -9.11…
## 9 140. (((104.37 -1.084843, 104.0108 -1.059212, 103.4376 -0.7119…
## 10 15.4 (((-68.63401 -52.63637, -68.63335 -54.8695, -67.56244 -54…
## # … with 167 more rows
The dplyr::transmute() function performs the same computation but also removes the other columns (except the geometry column).
Subsetting (filtering) your data based on geographic boundaries
The {USAboundaries} package has historical and contemporary boundaries for the United States provided by the U.S. Census Bureau.
Individual states are extracted using the us_states() function. CAUTION: this function has the same name as the object us_states from the {spData} package.
Here you use the argument states = to get only the state of Kansas. You then make a plot of the boundary and check the native coordinate reference system (CRS).
KS.sf <- USAboundaries::us_states(states = "Kansas")
library(ggplot2)
ggplot(data = KS.sf) +
geom_sf()
sf::st_crs(KS.sf)## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
The polygon geometry includes the border and the area inside the border. The CRS is described by the 4326 EPSG code and implemented using well-known text (WKT).
You use a geometric operation to subset spatial data geographically (rather than on some attribute). For example here you subset the tornado tracks as line strings, keeping only those line strings that fall within the Kansas border defined by a polygon geometry.
First import the tornado data. Note here you first ask if the tornado data file is in our list of files with the if() conditional and the list.files() function. You only download the data file if the file is not (!) in the list.
if(!"1950-2020-torn-aspath" %in% list.files(here::here("data"))) {
download.file(url = "http://www.spc.noaa.gov/gis/svrgis/zipped/1950-2020-torn-aspath.zip",
destfile = here::here("data", "1950-2020-torn-aspath.zip"))
unzip(here::here("data", "1950-2020-torn-aspath.zip"),
exdir = here::here("data"))
}
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-aspath"),
layer = "1950-2020-torn-aspath") ## Reading layer `1950-2020-torn-aspath' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-aspath'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
The geometries are line strings representing the approximate track of each tornado. The CRS has EPSG code of 4326, same as the Kansas polygon.
To keep only the tornado tracks that fall within the border of Kansas you use the sf::st_intersection() function. The first argument (x =) is the simple feature data frame that you want to subset and the second argument (y =) defines the geometry over which the subset occurs.
KS_Torn.sf <- sf::st_intersection(x = Torn.sf,
y = KS.sf)## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
You can use the pipe operator (|>) to pass the first argument to the function.
KS_Torn.sf <- Torn.sf |>
sf::st_intersection(y = KS.sf)## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
You make a plot to see if things appear as you expect.
ggplot() +
geom_sf(data = KS.sf) +
geom_sf(data = KS_Torn.sf)
Note that no tornado track lies outside the state border. Line strings that lie outside the border are clipped at the border. However the attribute values represent the entire track.
If you want the entire tornado track for all tornadoes that passed into (or through) the state, then you first use the geometric binary predict function sf::st_intersects(). With sparse = FALSE a matrix with a single column of TRUEs and FALSEs is returned. Here you use the piping operator to implicitly specify the x = argument as the Torn.sf data frame.
Intersects <- Torn.sf |>
sf::st_intersects(y = KS.sf, sparse = FALSE)
head(Intersects)## [,1]
## [1,] FALSE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE
sum(Intersects)## [1] 4377
Next you create a new data frame from the original data frame keeping only observations (rows) where Interects is TRUE.
KS_Torn2.sf <- Torn.sf[Intersects, ]
ggplot() +
geom_sf(data = KS.sf) +
geom_sf(data = KS_Torn2.sf)
Are tornadoes more common in some parts of Kansas than others? One way to answer this question is to see how far away the tornado centroid is from the center of the state.
Start by computing the centers of the state polygon and the combined set of Kansas tornadoes using the sf::st_centroid() function. Note you first use the sf::st_combine() function on the tornadoes.
geocenterKS <- KS.sf |>
sf::st_centroid()## Warning in st_centroid.sf(KS.sf): st_centroid assumes attributes are constant
## over geometries of x
centerKStornadoes <- KS_Torn.sf |>
sf::st_combine() |>
sf::st_centroid()Then make a map and compute the distance in meters using the sf::st_distance() function.
ggplot() +
geom_sf(data = KS.sf) +
geom_sf(data = geocenterKS, col = "blue") +
geom_sf(data = centerKStornadoes, col = "red")
geocenterKS |>
sf::st_distance(centerKStornadoes)## Units: [m]
## [,1]
## [1,] 2875.099
Less than 3 km!
More examples: https://www.jla-data.net/eng/spatial-aggregation/
Mutating data frames with joins
Combining data from different sources based on a shared variable is a common operation. The {dplyr} package has join functions that follow naming conventions used in database languages (like SQL).
Given two data frames labeled x and y, the join functions add columns from y to x, matching rows based on the function name.
inner_join(): includes all rows inxandyleft_join(): includes all rows inxfull_join(): includes all rows inxory
Join functions work the same on data frames and on simple feature data frames. The most common type of attribute join on spatial data takes a simple feature data frame as the first argument and adds columns to it from a data a frame specified as the second argument.
For example, you combine data on coffee production with the spData::world simple feature data frame. Coffee production by country is in the data frame called spData::coffee_data.
dplyr::glimpse(spData::coffee_data)## Rows: 47
## Columns: 3
## $ name_long <chr> "Angola", "Bolivia", "Brazil", "Burundi", "Came…
## $ coffee_production_2016 <int> NA, 3, 3277, 37, 8, NA, 4, 1330, 28, 114, NA, 1…
## $ coffee_production_2017 <int> NA, 4, 2786, 38, 6, NA, 12, 1169, 32, 130, NA, …
It has 3 columns: name_long names major coffee-producing nations and coffee_production_2016 and coffee_production_2017 contain estimated values for coffee production in units of 60-kg bags per year.
First select only the name and GDP (per person) from the spData::world simple feature data frame.
( world.sf <- spData::world |>
dplyr::select(name_long, gdpPercap) )## Simple feature collection with 177 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 3
## name_long gdpPercap geom
## <chr> <dbl> <MULTIPOLYGON [°]>
## 1 Fiji 8222. (((-180 -16.55522, -179.9174 -16.50178, -179.7933…
## 2 Tanzania 2402. (((33.90371 -0.95, 31.86617 -1.02736, 30.76986 -1…
## 3 Western Sahara NA (((-8.66559 27.65643, -8.817828 27.65643, -8.7948…
## 4 Canada 43079. (((-132.71 54.04001, -133.18 54.16998, -133.2397 …
## 5 United States 51922. (((-171.7317 63.78252, -171.7911 63.40585, -171.5…
## 6 Kazakhstan 23587. (((87.35997 49.21498, 86.82936 49.82667, 85.54127…
## 7 Uzbekistan 5371. (((55.96819 41.30864, 57.09639 41.32231, 56.93222…
## 8 Papua New Guinea 3709. (((141.0002 -2.600151, 141.0171 -5.859022, 141.03…
## 9 Indonesia 10003. (((104.37 -1.084843, 104.0108 -1.059212, 103.4376…
## 10 Argentina 18798. (((-68.63401 -52.63637, -68.63335 -54.8695, -67.5…
## # … with 167 more rows
The dplyr::left_join() function takes the data frame named by the argument x = and joins it to the data frame named by the argument y =.
( world_coffee.sf <- dplyr::left_join(x = world.sf,
y = spData::coffee_data) )## Joining, by = "name_long"
## Simple feature collection with 177 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.9 xmax: 180 ymax: 83.64513
## Geodetic CRS: WGS 84
## # A tibble: 177 × 5
## name_long gdpPercap geom coffee_producti…
## <chr> <dbl> <MULTIPOLYGON [°]> <int>
## 1 Fiji 8222. (((-180 -16.55522, -179.9174 -16… NA
## 2 Tanzania 2402. (((33.90371 -0.95, 31.86617 -1.0… 81
## 3 Western Sahara NA (((-8.66559 27.65643, -8.817828 … NA
## 4 Canada 43079. (((-132.71 54.04001, -133.18 54.… NA
## 5 United States 51922. (((-171.7317 63.78252, -171.7911… NA
## 6 Kazakhstan 23587. (((87.35997 49.21498, 86.82936 4… NA
## 7 Uzbekistan 5371. (((55.96819 41.30864, 57.09639 4… NA
## 8 Papua New Guinea 3709. (((141.0002 -2.600151, 141.0171 … 114
## 9 Indonesia 10003. (((104.37 -1.084843, 104.0108 -1… 742
## 10 Argentina 18798. (((-68.63401 -52.63637, -68.6333… NA
## # … with 167 more rows, and 1 more variable: coffee_production_2017 <int>
Because the two data frames share a common variable name (name_long) the join works without using the by = argument. The result is a simple feature data frame identical to the world.sf object but with two new variables indicating coffee production in 2016 and 2017.
names(world_coffee.sf)## [1] "name_long" "gdpPercap" "geom"
## [4] "coffee_production_2016" "coffee_production_2017"
For a join to work there must be at least one variable name in common.
Since the object listed in the x = argument is a simple feature data frame, the join function returns a simple feature data frame with the same number of rows (observations).
Although there are only 47 rows of data in spData::coffee_data, all 177 of the country records in world.sf are kept intact in world_coffee.sf. Rows in the first dataset with no match are assigned NA values for the new coffee production variables.
If you want to keep only countries that have a match in the key variable then use dplyr::inner_join(). Here you use the piping operator to implicitly specify the x = argument as the world.sf data frame.
world.sf |>
dplyr::inner_join(spData::coffee_data)## Joining, by = "name_long"
## Simple feature collection with 45 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -117.1278 ymin: -33.76838 xmax: 156.02 ymax: 35.49401
## Geodetic CRS: WGS 84
## # A tibble: 45 × 5
## name_long gdpPercap geom coffee_producti…
## <chr> <dbl> <MULTIPOLYGON [°]> <int>
## 1 Tanzania 2402. (((33.90371 -0.95, 31.86617 -1… 81
## 2 Papua New Guinea 3709. (((141.0002 -2.600151, 141.017… 114
## 3 Indonesia 10003. (((104.37 -1.084843, 104.0108 … 742
## 4 Kenya 2753. (((39.20222 -4.67677, 39.60489… 60
## 5 Dominican Republic 12663. (((-71.7083 18.045, -71.65766 … 1
## 6 Timor-Leste 6263. (((124.9687 -8.89279, 125.07 -… 14
## 7 Mexico 16623. (((-117.1278 32.53534, -116.72… 151
## 8 Brazil 15374. (((-53.37366 -33.76838, -52.71… 3277
## 9 Bolivia 6325. (((-69.52968 -10.95173, -68.66… 3
## 10 Peru 11548. (((-69.89364 -4.298187, -70.39… 585
## # … with 35 more rows, and 1 more variable: coffee_production_2017 <int>
You can join in the other direction as well, starting with a regular data frame and adding variables from a simple features object.
More information on attribute data operations such as these is given here: https://geocompr.robinlovelace.net/attr.html
Interpolation using areal weights
Areal-weighted interpolation estimates the value of some variable from a set of polygons to an overlapping but incongruent set of target polygons. For example, suppose you want demographic information given at the Census tract level to be estimated within the tornado damage path. Damage paths do not align with census tract boundaries so areal weighted interpolation is needed to get demographic estimates at the tornado level.
The function sf::st_interpolate_aw() performs areal-weighted interpolation of polygon data. As an example, consider the number of births by county in North Carolina in over the period 1970 through 1974 (BIR74).
The data are available as a shapefile as part of the {sf} package system file. Use the sf::st_read() function together with the system.file() function to import the data. Then create a map filling by the BIR74 variable.
nc.sf <- sf::st_read(system.file("shape/nc.shp",
package = "sf"))## Reading layer `nc' from data source
## `/Library/Frameworks/R.framework/Versions/4.2/Resources/library/sf/shape/nc.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS: NAD27
ggplot(data = nc.sf) +
geom_sf(mapping = aes(fill = BIR74))
Next construct a 20 by 10 grid of polygons that overlap the state using the sf::st_make_grid() function. The function takes the bounding box from the nc.sf simple feature data frame and constructs a two-dimension grid using the dimensions specified with the n = argument.
g.sfc <- sf::st_make_grid(nc.sf,
n = c(20, 10))
ggplot() +
geom_sf(data = g.sfc, col = "red") +
geom_sf(data = nc.sf, fill = "transparent")
The result is overlapping but incongruent sets of polygons as a sfc (simple feature column).
Then you use the sf::st_interpolate_aw() function with the first argument a simple feature data frame for which you want to aggregate a particular variable and the argument to = to the set of polygons for which you want the variable to be aggregated. The name of the variable must be put in quotes inside the subset operator []. The argument extensive = if FALSE (default) assumes the variable is spatially intensive (like population density) and the mean is preserved.
a1.sf <- sf::st_interpolate_aw(nc.sf["BIR74"],
to = g.sfc,
extensive = FALSE)## Warning in st_interpolate_aw.sf(nc.sf["BIR74"], to = g.sfc, extensive = FALSE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
The result is a simple feature data frame with the same polygons geometry as the sfc grid and a single variable called (BIR74).
( p1 <- ggplot() +
geom_sf(data = a1.sf, mapping = aes(fill = BIR74)) +
scale_fill_continuous(limits = c(0, 18000)) +
labs(title = "Intensive") )
Note that the average number of births across the state at the county level matches (roughly) the average number of births across the grid of polygons, but the sums do not match.
mean(a1.sf$BIR74) / mean(nc.sf$BIR74)## [1] 1.040669
sum(a1.sf$BIR74) / sum(nc.sf$BIR74)## [1] 1.436123
An intensive variable is independent of the spatial units (e.g., population density, percentages); a variable that has been normalized in some fashion. An extensive variable depends on the spatial unit (e.g., population totals). Assuming a uniform population density, the number of people will depend on the size of the spatial area.
Since the number of births in each county is an extensive variable, you change the extensive = argument to TRUE.
a2.sf <- sf::st_interpolate_aw(nc.sf["BIR74"],
to = g.sfc,
extensive = TRUE)## Warning in st_interpolate_aw.sf(nc.sf["BIR74"], to = g.sfc, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
( p2 <- ggplot(a2.sf) +
geom_sf(mapping = aes(fill = BIR74)) +
scale_fill_continuous(limits = c(0, 18000)) +
labs(title = "Extensive") )
In this case you preserve the total number of births across the domain. You verify this ‘mass preservation’ property (pycnophylactic property) with a ratio of one.
sum(a2.sf$BIR74) / sum(nc.sf$BIR74)## [1] 1
Here you create a plot of both interpolations.
library(patchwork)
p1 / p2
Example: tornado paths and housing units
Here you are interested in the number of houses (housing units) affected by tornadoes occurring in Florida 2014-2020. You begin by creating a polygon geometry for each tornado record.
Import the data, transform the native CRS to 3857 (pseudo-Mercator), and filter on yr (year) and st (state).
FL_Torn.sf <- Torn.sf |>
sf::st_transform(crs = 3857) |>
dplyr::filter(yr >= 2014,
st == "FL")Next change the geometries from line strings to polygons to represent the tornado path (‘footprint’). The path width is given by the variable labeled wid. First you create new a new variable with the width in units of meters and then use the st_buffer() function with the dist = argument set to 1/2 the width.
FL_Torn.sf <- FL_Torn.sf |>
dplyr::mutate(Width = wid * .9144)
FL_TornPath.sf <- FL_Torn.sf |>
sf::st_buffer(dist = FL_Torn.sf$Width / 2)To see the change from line string track to polygon path plot both together for one of the tornadoes.
ggplot() +
geom_sf(data = FL_TornPath.sf[10, ]) +
geom_sf(data = FL_Torn.sf[10, ], col = "red")
Now you want the number of houses within the path. The housing units are from the census data. You can access these data with the tidycensus::get_acs() function. The {tidycensus} package is an interface to the decennial US Census and American Community Survey APIs and the US Census Bureau’s geographic boundary files. Functions return Census and ACS data as simple feature data frames for all Census geographies.
Note: You need to get an API key from U.S. Census. Then
file.create("CensusAPI") # open then copy/paste your API keyTo ensure the file is only readable by you, not by any other user on the system use the function Sys.chmod() then read the key and install it.
Sys.chmod("CensusAPI", mode = "0400")
key <- readr::read_file("CensusAPI")
tidycensus::census_api_key(key, install = TRUE, overwrite = TRUE)
readRenviron("~/.Renviron")Make sure the file is listed in the file .gitignore so it doesn’t get included in your git public repository.
The geometry is the tract level and the variable is the un-weighted sample housing units (B00002_001). Transform the CRS to that of the tornadoes.
Census.sf <- tidycensus::get_acs(geography = "tract",
variables = "B00002_001",
state = "FL",
year = 2015,
geometry = TRUE) |>
sf::st_transform(crs = sf::st_crs(FL_TornPath.sf))## Getting data from the 2011-2015 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============= | 19%
|
|============== | 20%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 26%
|
|=================== | 27%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|============================ | 39%
|
|============================= | 41%
|
|============================== | 43%
|
|================================ | 45%
|
|================================= | 47%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================== | 65%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================= | 69%
|
|=================================================== | 73%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|======================================================= | 79%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================== | 88%
|
|================================================================ | 91%
|
|================================================================= | 93%
|
|================================================================== | 95%
|
|===================================================================== | 98%
|
|======================================================================| 100%
head(Census.sf)## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -9527976 ymin: 3452498 xmax: -9168692 ymax: 3522281
## Projected CRS: WGS 84 / Pseudo-Mercator
## GEOID NAME variable estimate
## 1 12001001201 Census Tract 12.01, Alachua County, Florida B00002_001 112
## 2 12001001519 Census Tract 15.19, Alachua County, Florida B00002_001 99
## 3 12001001520 Census Tract 15.20, Alachua County, Florida B00002_001 85
## 4 12001002207 Census Tract 22.07, Alachua County, Florida B00002_001 137
## 5 12001002218 Census Tract 22.18, Alachua County, Florida B00002_001 111
## 6 12005000805 Census Tract 8.05, Bay County, Florida B00002_001 159
## geometry
## 1 MULTIPOLYGON (((-9171497 34...
## 2 MULTIPOLYGON (((-9171172 34...
## 3 MULTIPOLYGON (((-9171771 34...
## 4 MULTIPOLYGON (((-9177078 34...
## 5 MULTIPOLYGON (((-9175225 34...
## 6 MULTIPOLYGON (((-9527976 35...
The column labeled estimate is the estimate of the number of housing units within the census tract.
Finally you use the sf::st_interpolate_aw() function to spatially interpolate the housing units to the tornado path.
awi.sf <- sf::st_interpolate_aw(Census.sf["estimate"],
to = FL_TornPath.sf,
extensive = TRUE)## Warning in st_interpolate_aw.sf(Census.sf["estimate"], to = FL_TornPath.sf, :
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
head(awi.sf)## Simple feature collection with 6 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -9502241 ymin: 3489409 xmax: -9452129 ymax: 3606099
## Projected CRS: WGS 84 / Pseudo-Mercator
## estimate geometry
## 1 1.801417e-02 POLYGON ((-9481985 3545304,...
## 2 3.493061e-05 POLYGON ((-9452129 3489418,...
## 3 3.493061e-05 POLYGON ((-9452129 3504835,...
## 4 6.396930e-03 POLYGON ((-9459939 3509978,...
## 5 2.001489e-05 POLYGON ((-9452129 3521558,...
## 6 5.666174e-02 POLYGON ((-9499599 3606097,...
range(awi.sf$estimate,
na.rm = TRUE)## [1] 0.0000 175.6654
The tornado that hit the most houses occurred just east of downtown Orlando.
awi.sf2 <- awi.sf |>
dplyr::filter(estimate > 175)
tmap::tmap_mode("view")## tmap mode set to interactive viewing
tmap::tm_shape(awi.sf2) +
tmap::tm_borders()Thursday September 15, 2022
“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” – Bill Gates
Today
- S4 spatial data objects
- Working with raster data
S4 spatial data objects
The {sp} package has methods for working with spatial data as S4 reference classes. A few of the packages we will use this semester for analyzing/modeling spatial data work only with {sp} objects so it is helpful to see how they are structured.
Install and load the package.
if(!require(sp)) install.packages(pkgs = "sp", repos = "http://cran.us.r-project.org")## Loading required package: sp
library(sp)Spatial objects from the {sp} package fall into two types:
- spatial-only information (the geometry). Geometries include
SpatialPoints,SpatialLines,SpatialPolygons, etc, and - extensions to these types where attribute information is available and stored in a data frame. These include
SpatialPointsDataFrame,SpatialLinesDataFrame, etc.
The typical situation is that you have a simple feature data frame (an S3 spatial object) and you need to convert it to an {sp} spatial data frame before the data can be analyzed or modeled.
Consider again the the tornado tracks that you import as a simple feature data frame.
FL_Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-aspath"),
layer = "1950-2020-torn-aspath") |>
dplyr::filter(st == "FL")## Reading layer `1950-2020-torn-aspath' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-aspath'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
class(FL_Torn.sf)## [1] "sf" "data.frame"
The object FL_Torn.sf is a simple feature data frame (S3 spatial data object). You convert the simple feature data frame to an S4 spatial data object using the sf::as_Spatial() function.
FL_Torn.sp <- FL_Torn.sf |>
sf::as_Spatial()
class(FL_Torn.sp)## [1] "SpatialLinesDataFrame"
## attr(,"package")
## [1] "sp"
The file FL_Torn.sp is a spatial object of class SpatialLinesDataFrame.
Information in S4 spatial objects is stored in slots. Slot names are listed with the slotNames() function.
FL_Torn.sp |>
slotNames()## [1] "data" "lines" "bbox" "proj4string"
The data slot contains the data frame (attribute table), the lines slot contains the spatial geometries (in this case lines), the bbox slot is the boundary box and the proj4string slot is the CRS.
The object name followed by the @ symbol allows access to information in the slot. The @ symbol is similar to the $ symbol for regular data frames. For example to see the first three rows of the data frame type
FL_Torn.sp@data[1:3, ]## om yr mo dy date time tz st stf stn mag inj fat loss closs slat
## 1 29 1950 3 16 1950-03-16 09:15:00 3 FL 12 1 2 0 0 3 0 29.65
## 2 105 1950 5 15 1950-05-15 11:00:00 3 FL 12 3 1 0 0 4 0 28.58
## 3 106 1950 5 15 1950-05-15 11:00:00 3 FL 12 4 2 0 0 4 0 28.50
## slon elat elon len wid fc
## 1 -81.22 29.6501 -81.2199 1.5 150 0
## 2 -81.37 28.5801 -81.3699 0.1 10 0
## 3 -81.37 28.5001 -81.3699 0.1 10 0
You recognize this as information about the first three tornadoes in the record. In fact, the object name together with the slot name data has class data.frame.
class(FL_Torn.sp@data)## [1] "data.frame"
When using the $ symbol on S4 spatial objects, you access the columns as you would a data frame. For example, to list the EF rating (column labeled mag) of the first 3 tornadoes.
FL_Torn.sp$mag[1:3]## [1] 2 1 2
Selecting, retrieving, or replacing attributes in S4 spatial data frames is done with methods in {base} R package. For example [] is used to select rows and/or columns. To select mag of the 7th tornado type
FL_Torn.sp$mag[7]## [1] 1
Other methods include: plot(), summary(),dim() and names() (operate on the data slot), as.data.frame(), as.matrix() and image() (for spatial data on a grid), and length() (number of features).
You can’t use the {dplyr} verbs on S4 data frames. To convert from an S4 spatial data frame to a simple feature data frame use sf::st_as_sf().
The first spatial geometry is given as the first element of the lines list.
FL_Torn.sp@lines[1]## [[1]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -81.2200 29.6500
## [2,] -81.2199 29.6501
##
##
##
## Slot "ID":
## [1] "1"
It is an object of class Lines. The line is identified by a matrix indicating the longitude and latitude of the start point in row one and the longitude and latitude of the end point in row two.
The bbox slot is an object of class matrix and array and the proj4string slot is of class CRS.
The interface to the geometry engine-open source (GEOS) is through the {rgeos} package.
Working with raster data
The raster data model divides geographic space into a grid of cells of constant size (resolution) and we use classes from the {raster} package to work with raster data.
A raster is a data structure that divides space into rectangles called ‘cells’ (or ‘pixels’). Each cell has an attribute value.
The {terra} package has functions for creating, reading, manipulating, and writing raster data as S3 reference class objects SpatRaster and SpatVect.
To see what methods (functions) for class SpatRaster are available use the methods() function.
methods(class = "SpatRaster")## [1] [ [[ [[<- [<- $
## [6] $<- Arith as.data.frame as.list as.matrix
## [11] coerce Compare Logic Math Math2
## [16] merge plot show split spplot
## [21] summary Summary
## see '?methods' for accessing help and source code
The list includes {base} R and {sf} methods.
The terra::rast() function creates a raster with a geographic (longitude/latitude) CRS and a 1 by 1 degree grid of cells across the globe.
r <- terra::rast()
r## class : SpatRaster
## dimensions : 180, 360, 1 (nrow, ncol, nlyr)
## resolution : 1, 1 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84
Arguments including xmin, nrows, ncols, and crs are used to change these default settings.
The object has class SpatRaster with geographic coordinates spanning the globe at one-degree resolution in the north-south and the east-west directions.
To create a raster with 36 longitudes -100 and 0 degrees East longitude and 18 latitudes between the equator and 50 degrees N latitude we specify the number of columns, the number of rows and the extent as follows.
r <- terra::rast(ncols = 36, nrows = 18,
xmin = -100, xmax = 0,
ymin = 0, ymax = 50)
r## class : SpatRaster
## dimensions : 18, 36, 1 (nrow, ncol, nlyr)
## resolution : 2.777778, 2.777778 (x, y)
## extent : -100, 0, 0, 50 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84
terra::res(r)## [1] 2.777778 2.777778
This results in raster with cell resolution of 2.7 degrees of longitude and 2.7 degrees of latitude.
The structure of the raster can be changed after created. Here you change the resolution to 3 degrees. This induces changes to the number of rows and columns.
terra::res(r) <- 3
ncol(r)## [1] 33
nrow(r)## [1] 17
The SpatRaster object r is a template with no values assigned to the cells and by default it will have an extent that spans the globe.
r <- terra::rast(ncol = 10, nrow = 10)
terra::ncell(r)## [1] 100
terra::hasValues(r)## [1] FALSE
Here there are 100 cells in a 10 by 10 arrangement with no values in any of the cells.
The terra::values() function is used to place values in the cells. The function is specified on the left-hand side of the assignment operator. First you assign to a vector of length terra::ncell(r) random numbers from a uniform distribution with the runif() function. The default is that the random numbers are between 0 and 1.
v <- runif(terra::ncell(r))
head(v)## [1] 0.35177504 0.16728355 0.93080046 0.38447031 0.61471771 0.02374155
terra::values(r) <- v
head(r)## class : SpatRaster
## dimensions : 6, 10, 1 (nrow, ncol, nlyr)
## resolution : 36, 18 (x, y)
## extent : -180, 180, -18, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84
## source : memory
## name : lyr.1
## min value : 0.0008497667
## max value : 0.9960303
The cells are arranged in lexicographical order (upper left to lower right) and the cells are populated with values from the vector in this order.
The terra::plot() function creates a choropleth map of the values in cells.
terra::plot(r)
The default CRS is geographic.
terra::crs(r)## [1] "GEOGCRS[\"WGS 84\",\n DATUM[\"World Geodetic System 1984\",\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ID[\"EPSG\",6326]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8901]],\n CS[ellipsoidal,2],\n AXIS[\"longitude\",east,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433,\n ID[\"EPSG\",9122]]],\n AXIS[\"latitude\",north,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433,\n ID[\"EPSG\",9122]]]]"
To re-project the raster use the function terra::project().
Here you create a new raster with cell numbers as values using the terra::setValues() function to place the numbers in the cells.
r <- terra::rast(xmin = -110, xmax = -90,
ymin = 40, ymax = 60,
ncols = 10, nrows = 10)
r <- terra::setValues(r, 1:terra::ncell(r))
terra::plot(r)
The values increase starting from top left to bottom right as dictated by the sequence 1:terra::ncell(r) and the lexicographic order in which the raster grids are filled.
The terra::rast() function imports data with functions from the {rgdal} package. Supported formats include GeoTIFF, ESRI, ENVI, and ERDAS. Most formats that can import a raster can also be used to export a raster.
Consider the Meuse dataset (from the {sp} package), using a file in the native ‘raster- file’ format.
f <- system.file("external/test.grd",
package = "raster")
r <- terra::rast(f)Do the cells contain values? Is the raster stored in memory? Create a plot.
terra::hasValues(r)## [1] TRUE
terra::inMemory(r)## [1] FALSE
terra::plot(r, main = "Raster layer from file")
Note the raster is a set of cells arranged in a rectangular array. Values that are coded as NA are not plotted.
SpatRaster objects can have more than one raster. These are called layers.
r## class : SpatRaster
## dimensions : 115, 80, 1 (nrow, ncol, nlyr)
## resolution : 40, 40 (x, y)
## extent : 178400, 181600, 329400, 334000 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=sterea +lat_0=52.1561605555556 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +datum=WGS84 +units=m +no_defs
## source : test.grd
## name : test
## min value : 138.7071
## max value : 1736.058
The dimensions are nrow = 115 by ncol = 80 and nlyr = 1.
You can add layers to the object. Here you create three rasters and assign random values to the cells.
r1 <- terra::rast(nrow = 10, ncol = 10)
terra::values(r1) <- runif(terra::ncell(r1))
r2 <- terra::rast(nrow = 10, ncol = 10)
terra::values(r2) <- runif(terra::ncell(r2))
r3 <- terra::rast(nrow = 10, ncol = 10)
terra::values(r3) <- runif(terra::ncell(r3))You combine the rasters into a single SpatRaster object with the concatenate function c().
s <- c(r1, r2, r3)
s## class : SpatRaster
## dimensions : 10, 10, 3 (nrow, ncol, nlyr)
## resolution : 36, 18 (x, y)
## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84
## sources : memory
## memory
## memory
## names : lyr.1, lyr.1, lyr.1
## min values : 0.0059893031, 0.0004921041, 0.0040800548
## max values : 0.9723254, 0.9994648, 0.9924234
dim(s)## [1] 10 10 3
terra::nlyr(s)## [1] 3
terra::plot(s)
Each raster is a separate layer.
Here you import a set of raster layers from a file.
f <- system.file("external/rlogo.grd",
package = "raster")
b <- terra::rast(f)
b## class : SpatRaster
## dimensions : 77, 101, 3 (nrow, ncol, nlyr)
## resolution : 1, 1 (x, y)
## extent : 0, 101, 0, 77 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs
## source : rlogo.grd
## colors RGB : 1, 2, 3
## names : red, green, blue
## min values : 0, 0, 0
## max values : 255, 255, 255
terra::plot(b)
Most {base} R functions (+, *, round(), ceiling(), log(), etc) work on raster objects. Operations are done on all cells at once.
Here you place the numbers from 1 to 100 sequentially in the cells, then add 100 to these values and take the square root.
r <- terra::rast(ncol = 10, nrow = 10)
terra::values(r) <- 1:terra::ncell(r)
s <- r + 100
s <- sqrt(s)
terra::plot(s)
Here you replace the cell values with random uniform numbers between 0 and 1. Then round to the nearest integer and add one.
r <- terra::rast(ncol = 10, nrow = 10)
terra::values(r) <- runif(terra::ncell(r))
r <- round(r)
r <- r + 1
terra::plot(r)
Replace only certain values with the subset function [].
r <- terra::rast(xmin = -90, xmax = 90, ymin = -30, ymax = 30)
terra::values(r) <- rnorm(terra::ncell(r))
terra::plot(r)
r[r > 2] <- 0
terra::plot(r)
Functions for manipulating a raster
The terra::crop() function takes a geographic subset of a larger raster object. A raster is cropped by providing an extent object or other spatial object from which an extent can be extracted (objects from classes deriving from raster and from spatial in the {sp} package).
The terra::trim() function crops a raster layer by removing the outer rows and columns that only contain NA values. The terra::extend() function adds new rows and/or columns with NA values.
The terra::merge() function combines two or more rasters into a single raster. The input objects must have the same resolution and origin (such that their cells fit into a single larger raster). If this is not the case, first adjust one of the objects with the functions aggregate() or resample().
The terra::aggregate() and terra::disagg() functions change the resolution (cell size) of a raster object.
As a simple example showing some of this functionality here you crop the raster into two pieces and then merge the two pieces into one. The terra::merge() function has an argument that allows you to export to a file (here test.grd).
r1 <- terra::crop(r, terra::ext(-180, 0, 0, 30))
r2 <- terra::crop(r, terra::ext(-10, 180, -20, 10))
m <- terra::merge(r1, r2,
filename = here::here('outputs', 'test.grd'),
overwrite = TRUE)
terra::plot(m)
The terra::flip() function flips the data (reverse order) in the horizontal or vertical direction. The terra::rotate() function rotates a raster that have longitudes from 0 to 360 degrees (often used by climatologists) to the standard -180 to 180 degrees system.
You extract values from a raster for a set of locations with the terra::extract() function. The locations can be a vector object (points, lines, polygons), a matrix with (x, y) or (longitude, latitude – in that order!) coordinates, or a vector with cell numbers.
r <- terra::rast(ncols = 5, nrows = 5,
xmin = 0, xmax = 5,
ymin = 0, ymax = 5)
terra::values(r) <- 1:25
xy <- rbind(c(.5, .5), c(2.5, 2.5))
p <- terra::vect(xy, crs="+proj=longlat +datum=WGS84")
terra::extract(r, xy)## lyr.1
## 1 21
## 2 13
terra::extract(r, p)## ID lyr.1
## 1 1 21
## 2 2 13
To convert the values of a raster layer to points or polygons we use as.points() and as.polygons(). These functions return a SpatVector object for cells that are not missing value.
Vector data is converted to a raster with the terra::rasterize() function. Polygon to raster conversion is often done to create a mask (i.e. to set to NA a set of cells of a raster object, or to summarize values on a raster by zone. For example a country polygon is converted to a raster that is used to set all the cells outside that country to NA. Also polygons representing administrative regions such as states can be converted to a raster to summarize values by region. Point to raster conversion is often done to analyze location data (location of a specific species of tree in a forest).
Example: the number of tornadoes passing through each grid cell
Here you want a latitude/longitude grid (1/2 degree latitude by 1/2 degree longitude) with each cell in the grid containing the number of tornadoes that went through it since 2003.
First import the tornado (initial track point) data as a simple feature data frame.
if(!"1950-2020-torn-initpoint" %in% list.files(here::here("data"))) {
download.file(url = "http://www.spc.noaa.gov/gis/svrgis/zipped/1950-2020-torn-initpoint.zip",
destfile = here::here("data", "1950-2020-torn-initpoint.zip"))
unzip(here::here("data", "1950-2020-torn-initpoint.zip"),
exdir = here::here("data"))
}
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint"),
layer = "1950-2020-torn-initpoint") |>
dplyr::filter(yr >= 2003)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
Note the extent of the bounding box and check the native CRS.
sf::st_crs(Torn.sf)## Coordinate Reference System:
## User input: WGS 84
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]
The CRS is geographic.
Next create a raster (called frame) with a domain that covers the area of interest and assign a resolution of one degree in longitude and one degree in latitude. Check the extent of the raster with the terra::ext() function.
frame <- terra::rast(xmin = -106, xmax = -67,
ymin = 24, ymax = 50)
terra::res(frame) <- .5
terra::ext(frame)## SpatExtent : -106, -67, 24, 50 (xmin, xmax, ymin, ymax)
Next use the terra::rasterize() function to count the number of times each raster cell contains a tornado. The first argument is the spatial data frame and the second is the raster without values. The argument field = specifies a column name in the spatial data frame (here just an identifier) and the argument fun = specifies what to do. Here you want a count of the unique instances of the field in each cell and this is done with setting fun = "length". Raster cells without tornadoes are given a value of 0 based on the background = argument.
Torn.v <- terra::vect(Torn.sf)
Torn.r <- terra::rasterize(x = Torn.v,
y = frame,
field = "om",
fun = "length",
background = 0)
class(Torn.r)## [1] "SpatRaster"
## attr(,"package")
## [1] "terra"
dim(Torn.r)## [1] 52 78 1
The result is a raster layer. The number of tornadoes occurring in each cell are the values.
We print out the first 200 values (lexicographical order).
terra::values(Torn.r)[1:200]## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [26] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [51] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [76] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [101] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [126] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [151] 0 0 0 0 0 0 0 1 1 4 2 3 4 6 5 1 0 4 6 13 8 7 20 18 12
## [176] 5 6 5 4 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
To visualize the raster use the plot() method.
terra::plot(Torn.r)
You can recognize the broad shape of the eastern 2/3rds of the United States. Some cells across the Plains and the South have quite a few tornadoes and very few tornadoes in cells over the Appalachian Mountains.
Clustering
Indeed tornado activity appears in distinct clusters (or groups). A statistic that estimates the amount of cluster is called Moran’s I. It is a global measure of clustering with high values indicated by high values nearby to other high values and low values nearby to other low values.
Values of Moran’s I range from -1 to +1 where positive values indicate clustering and negative values indicate regularity (e.g., chessboard). It is implemented on a raster with the raster::Moran() function.
The function works only with S4 raster objects. So you need to first convert Torn.r from a SpatRaster to a RasterLayer. You do this with raster() function after loading the {raster} package.
library(raster)
Torn.r2 <- raster(Torn.r)
class(Torn.r2)## [1] "RasterLayer"
## attr(,"package")
## [1] "raster"
str(Torn.r2)## Formal class 'RasterLayer' [package "raster"] with 12 slots
## ..@ file :Formal class '.RasterFile' [package "raster"] with 13 slots
## .. .. ..@ name : chr ""
## .. .. ..@ datanotation: chr "FLT4S"
## .. .. ..@ byteorder : chr "little"
## .. .. ..@ nodatavalue : num -Inf
## .. .. ..@ NAchanged : logi FALSE
## .. .. ..@ nbands : int 1
## .. .. ..@ bandorder : chr "BIL"
## .. .. ..@ offset : int 0
## .. .. ..@ toptobottom : logi TRUE
## .. .. ..@ blockrows : int 0
## .. .. ..@ blockcols : int 0
## .. .. ..@ driver : chr ""
## .. .. ..@ open : logi FALSE
## ..@ data :Formal class '.SingleLayerData' [package "raster"] with 13 slots
## .. .. ..@ values : num [1:4056] 0 0 0 0 0 0 0 0 0 0 ...
## .. .. ..@ offset : num 0
## .. .. ..@ gain : num 1
## .. .. ..@ inmemory : logi TRUE
## .. .. ..@ fromdisk : logi FALSE
## .. .. ..@ isfactor : logi FALSE
## .. .. ..@ attributes: list()
## .. .. ..@ haveminmax: logi TRUE
## .. .. ..@ min : num 0
## .. .. ..@ max : num 57
## .. .. ..@ band : int 1
## .. .. ..@ unit : chr ""
## .. .. ..@ names : chr "lyr.1"
## ..@ legend :Formal class '.RasterLegend' [package "raster"] with 5 slots
## .. .. ..@ type : chr(0)
## .. .. ..@ values : logi(0)
## .. .. ..@ color : logi(0)
## .. .. ..@ names : logi(0)
## .. .. ..@ colortable: logi(0)
## ..@ title : chr(0)
## ..@ extent :Formal class 'Extent' [package "raster"] with 4 slots
## .. .. ..@ xmin: num -106
## .. .. ..@ xmax: num -67
## .. .. ..@ ymin: num 24
## .. .. ..@ ymax: num 50
## ..@ rotated : logi FALSE
## ..@ rotation:Formal class '.Rotation' [package "raster"] with 2 slots
## .. .. ..@ geotrans: num(0)
## .. .. ..@ transfun:function ()
## ..@ ncols : int 78
## ..@ nrows : int 52
## ..@ crs :Formal class 'CRS' [package "sp"] with 1 slot
## .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +no_defs"
## .. .. ..$ comment: chr "GEOGCRS[\"WGS 84\",\n DATUM[\"World Geodetic System 1984\",\n ELLIPSOID[\"WGS 84\",6378137,298.257223"| __truncated__
## ..@ history : list()
## ..@ z : list()
The object Torn.r2 is a RasterLayer as an S4 data class. Note the use of slots for storing the information.
You can use the raster::Moran() function on the RasterLayer object.
raster::Moran(Torn.r2)## [1] 0.7524014
The value of .75 indicates high level of tornado clustering at this scale.
Under the null hypothesis of no spatial autocorrelation the expected value for Moran’s I is close to zero [-1/(n-1), where n is the number of cells].
Clusters at a local level can be found using a local indicator of spatial autocorrelation. One such indicator is local Moran’s I, which is computed at each cell (using the MoranLocal() function) so the result is a raster.
Torn_lmi.r <- raster::MoranLocal(Torn.r2)
plot(Torn_lmi.r)
This type of plot makes is easy to identify the hot spots of tornadoes over parts of the South and the Central Plains.
To convert the local Moran raster to a S4 spatial data frame with polygon geometries use the rasterToPolygons() function.
Torn_lmi.sp <- raster::rasterToPolygons(Torn_lmi.r)
class(Torn_lmi.sp)## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Then convert the SpatialPolygonsDataFrame to a simple features data frame and make a plot.
Torn_lmi.sf <- sf::st_as_sf(Torn_lmi.sp)
library(ggplot2)
ggplot(data = Torn_lmi.sf) +
geom_sf(mapping = aes(fill = layer, color = layer))
Or using functions from the {tmap} package you map the raster layer directly.
tmap::tmap_mode("view")## tmap mode set to interactive viewing
tmap::tm_shape(Torn_lmi.r) +
tmap::tm_raster(alpha = .7)## Linking to GEOS 3.10.2, GDAL 3.4.2, PROJ 8.2.1; sf_use_s2() is TRUE
## Variable(s) "NA" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Focal (neighborhood) functions
The function terra::focal() computes statistics in a neighborhood of cells around a focal cell, putting the result in the focal cell of an output raster. The terra::distance() function computes the shortest distance to cells that are not NA. The terra::direction() function computes the direction towards (or from) the nearest cell that is not NA. The adjacent() function determines which cells are adjacent to other cells.
Functions from the {raster} package require data objects to be in the S4 reference class. S4 reference classes allow rich data representations at the expense of flexibility. The S3 reference class objects are more flexible, easier to maintain, and allow for new dialects (e.g., {dplyr}, {ggplot2}). Most packages on CRAN use S3 reference class objects.
Consider a multi-band image taken from a Landsat 7 view of a small part of the Brazilian coast. It is included in the {stars} package and stored as a GeoTIFF file labeled L7_ETMs.tif. You import the image as a raster stack.
if(!require(stars)) install.packages("stars", repos = "http://cran.us.r-project.org")## Loading required package: stars
## Loading required package: abind
library(stars)
f <- system.file("tif/L7_ETMs.tif",
package = "stars")
library(raster)
L7.rs <- stack(f)
class(L7.rs)## [1] "RasterStack"
## attr(,"package")
## [1] "raster"
The data L7.rs is a RasterStack object as a S4 reference class.
You list the slot names and extract the extent and CRS using the @ syntax.
L7.rs@extent## class : Extent
## xmin : 288776.3
## xmax : 298722.8
## ymin : 9110729
## ymax : 9120761
L7.rs@crs## Coordinate Reference System:
## Deprecated Proj.4 representation:
## +proj=utm +zone=25 +south +ellps=GRS80 +units=m +no_defs
## WKT2 2019 representation:
## PROJCRS["SIRGAS 2000 / UTM zone 25S",
## BASEGEOGCRS["SIRGAS 2000",
## DATUM["Sistema de Referencia Geocentrico para las AmericaS 2000",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4674]],
## CONVERSION["UTM zone 25S",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-33,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9996,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",10000000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["Brazil - between 36°W and 30°W, northern and southern hemispheres, onshore and offshore."],
## BBOX[-23.8,-36,4.19,-29.99]],
## ID["EPSG",31985]]
You extract a single band (layer) from the stack with the layer = argument in the raster() function. You then plot the raster values with the plot() method and compute the spatial autocorrelation with the raster::Moran() function.
L7.rB3 <- raster(L7.rs, layer = 3)
plot(L7.rB3)
raster::Moran(L7.rB3)## [1] 0.8131887
You convert the raster to an S3 reference class data frame with the as.data.frame() method. Here you do that and then compute the normalized difference vegetation index (NDVI) using columns L7_ETMs.4 and L7_ETMs.3 and the mutate() function from the {dplyr} package.
NDVI indicates live green vegetation from satellite images. Higher values indicate more green vegetation, negative values indicate water.
L7.df <- as.data.frame(L7.rs) |>
dplyr::mutate(NDVI = (L7_ETMs.4 - L7_ETMs.3)/(L7_ETMs.4 + L7_ETMs.3))More examples and other functions for working with raster data using functions from the {terra} package are illustrated in https://geocompr.robinlovelace.net/raster-vector.html. I encourage you to take a look.
Tuesday September 20, 2022
“Maps invest information with meaning by translating it into visual form.” – Susan Schulten
Today
- Working with space-time data
- Making maps
Working with space-time data
Space-time data arrive in the form of multi-dimensional arrays. Examples include:
- raster images
- socio-economic or demographic data
- environmental variables monitored at fixed stations
- time series of satellite images with multiple spectral bands
- spatial simulations
- climate and weather model output
The {stars} package provides functions and methods for working with space-time data as multi-dimensional S3 reference class arrays.
To see what methods (functions) for class stars are available use the methods() function.
methods(class = "stars")## [1] [ [[<- [<- %in%
## [5] $<- adrop aggregate aperm
## [9] as.data.frame c coerce contour
## [13] cut dim dimnames dimnames<-
## [17] droplevels filter hist image
## [21] initialize is.na Math merge
## [25] Ops plot predict print
## [29] select show slotsFromS3 split
## [33] st_apply st_area st_as_sf st_as_sfc
## [37] st_as_stars st_bbox st_coordinates st_crop
## [41] st_crs st_crs<- st_dimensions st_dimensions<-
## [45] st_downsample st_extract st_geometry st_interpolate_aw
## [49] st_intersects st_join st_mosaic st_normalize
## [53] st_redimension st_sample st_set_bbox st_transform_proj
## [57] st_transform write_stars
## see '?methods' for accessing help and source code
The list includes {base} R and {tidyverse} methods.
The typical data array is that where two dimensions represent spatial raster dimensions and the third dimensions is a band (or time). Data array
But arrays can have more dimensions. For example, time, space, spectral band, and sensor type. Data cube
You import a set of rasters (raster stack) as a {stars} object using the stars::read_stars() function. Consider the multi-band image taken from a Landsat 7 view of a small part of the Brazilian coast. It is included in the {stars} package and stored as a GeoTIFF file labeled L7_ETMs.tif.
f <- system.file("tif/L7_ETMs.tif",
package = "stars")
L7.stars <- stars::read_stars(f)
L7.stars## stars object with 3 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## L7_ETMs.tif 1 54 69 68.91242 86 255
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [x]
## y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [y]
## band 1 6 NA NA NA NA NULL
dim(L7.stars)## x y band
## 349 352 6
There are three dimensions to this {stars} object, two spatial (x and y), and the third across six bands (band). Values across the six bands and space are summarized as a single attribute with name L7_ETMs.tif.
The data are stored in a four dimensional array. The first index is the attribute, the second and third indexes are the spatial coordinates, and the fourth index is the band.
Here you plot bands 3 and 4 by sequencing on the fourth index and using the plot() method.
plot(L7.stars[,,,3:4])
Since the data object is S3 you use functions from the ggplot2() package together with the geom_stars() layer from the {stars} package to plot all 6 bands with a common color scale bar.
library(ggplot2)
ggplot() +
stars::geom_stars(data = L7.stars) +
facet_wrap(~ band)
You create a new {stars} object by applying a function to the band values. For example here you compute normalized difference vegetation index (NDVI) through a function applied across the x and y spatial dimensions using the stars::st_apply() method after creating the function NDVI().
NDVI <- function(z) (z[4] - z[3]) / (z[4] + z[3])
( NDVI.stars <- stars::st_apply(L7.stars,
MARGIN = c("x", "y"),
FUN = NDVI) )## stars object with 2 dimensions and 1 attribute
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## NDVI -0.7534247 -0.2030075 -0.06870229 -0.06432464 0.1866667 0.5866667
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [x]
## y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [y]
ggplot() +
stars::geom_stars(data = NDVI.stars) 
The stars data frame can also be split, here on the band dimension, to yield a representation as six rasters in the list form.
( L7split.stars <- split(L7.stars,
f = "band") )## stars object with 2 dimensions and 6 attributes
## attribute(s):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## X1 47 67 78 79.14772 89 255
## X2 32 55 66 67.57465 79 255
## X3 21 49 63 64.35886 77 255
## X4 9 52 63 59.23541 75 255
## X5 1 63 89 83.18266 112 255
## X6 1 32 60 59.97521 88 255
## dimension(s):
## from to offset delta refsys point values x/y
## x 1 349 288776 28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [x]
## y 1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE NULL [y]
Now the bands are given as columns in the data frame part of the {stars} object and there are only two dimensions (x and y).
Monthly precipitation across the globe
Here you import a NetCDF (Network Common Data Form) file as a space-time raster. NetCDF is a set of formats that support scientific data as arrays. Here the data are monthly global precipitation anomalies on 2.5 by 2.5 degree lat/lon grid. You read the NetCDF file using three array dimensions, two planar space, and the third is time (monthly starting in 1948).
if(!"precip.mon.anom.nc" %in% list.files(here::here("data"))) {
download.file(url = "http://myweb.fsu.edu/jelsner/temp/data/precip.mon.anom.nc",
destfile = here::here("data", "precip.mon.anom.nc"))
}
( w.stars <- stars::read_stars(here::here("data", "precip.mon.anom.nc")) )There are two spatial dimensions and the third dimension is time in months. There is one attribute which is the rain rate in millimeters per day (mm/d).
Here you plot the first month of the global precipitation anomalies.
plot(w.stars[,,,1])Raster data do not need to be regular or aligned along the cardinal directions. Functions in the {stars} package supports rotated, sheared, rectilinear and curvi-linear grids. Grids
Functions in the {stars} package also support the vector data model. Vector data cubes arise when you have a single dimension that points to distinct spatial feature geometry, such as polygons (e.g. denoting administrative regions). Vector data cube polygons
Or points (e.g., denoting sensor locations). Vector data cube points
For more see: https://github.com/r-spatial/stars/tree/master/vignettes and https://awesomeopensource.com/project/r-spatial/stars
Also you can check out some rough code that I’ve been working on to take advantage of the {stars} functionality including plotting daily temperatures across the U.S. and creating a vector data cube of COVID19 data in the stars.Rmd file on course GitHub site in the folder Other_Rmds.
Mapping using functions from the {ggplot2} package
The {ggplot2} package has supports sf objects for making maps through the function geom_sf(). An initial ggplot() function is followed by one or more layers that are added with + symbol. The layers begin with geom_.
For example, consider the objects nz and nz_height from the {spData} package, where nz is a simple feature data frame from the New Zealand census with information about the area, population, and sex ratio (male/female) in the country’s 16 administrative regions.
str(spData::nz)## Classes 'sf' and 'data.frame': 16 obs. of 7 variables:
## $ Name : chr "Northland" "Auckland" "Waikato" "Bay of Plenty" ...
## $ Island : chr "North" "North" "North" "North" ...
## $ Land_area : num 12501 4942 23900 12071 8386 ...
## $ Population : num 175500 1657200 460100 299900 48500 ...
## $ Median_income: int 23400 29600 27900 26200 24400 26100 29100 25000 32700 26900 ...
## $ Sex_ratio : num 0.942 0.944 0.952 0.928 0.935 ...
## $ geom :sfc_MULTIPOLYGON of length 16; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:68, 1:2] 1745493 1740539 1733165 1720197 1709110 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geom"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA
## ..- attr(*, "names")= chr [1:6] "Name" "Island" "Land_area" "Population" ...
The simple feature column (sfc) is labeled geom and the geometry type is multi-polygon.
And spData::nz_height is a simple feature data frame containing the elevation of specific high points (peaks) in New Zealand.
str(spData::nz_height)## Classes 'sf' and 'data.frame': 101 obs. of 3 variables:
## $ t50_fid : int 2353944 2354404 2354405 2369113 2362630 2362814 2362817 2363991 2363993 2363994 ...
## $ elevation: int 2723 2820 2830 3033 2749 2822 2778 3004 3114 2882 ...
## $ geometry :sfc_POINT of length 101; first list element: 'XY' num 1204143 5049971
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
## ..- attr(*, "names")= chr [1:2] "t50_fid" "elevation"
The simple feature column is labeled geometry and the geometry type is point.
You make a choropleth map of the median income in the New Zealand regions and add a layer indicating the location of the elevation peaks.
ggplot() +
geom_sf(data = spData::nz,
mapping = aes(fill = Median_income)) +
geom_sf(data = spData::nz_height) +
scale_x_continuous(breaks = c(170, 175))
The first use of geom_sf() takes the geometry column of the simple feature data frame spData::nz for mapping the spatial aesthetic. The mapping = argument specifies other aesthetics with the aes() function. Here fill = points to the column Medium_income in the simple feature data frame. The second use of geom_sf() takes the geometry column of spData::nz_height and adds the location of the highest peaks as points.
The geom_sf() function automatically plots graticules (lines of latitude and longitude) with labels. The default ranges for the graticules can be overridden using scale_x_continuous(), scale_y_continuous() or coord_sf(datum = NA).
The advantage of using functions from {ggplot2} for mapping include a large community of users and many add-on packages.
Another example: the county land area by state in the U.S. The data is a simple feature data frame available in the {USAboundariesData} package at ropensci.org (not on CRAN).
install.packages("USAboundariesData",
repos = "http://packages.ropensci.org",
type = "source")Here you extract the county borders in Florida then make a choropleth of the land area.
FLcounties.sf <- USAboundaries::us_counties(states = "FL")
ggplot() +
geom_sf(data = FLcounties.sf,
mapping = aes(fill = aland))
Mapping using functions from the {tmap} package
There are several other packages for making quick, nice maps listed in the syllabus.
I particularly like the {tmap} package because it is agnostic to the type of spatial data object. Simple feature data frames as well as {sp} and {raster} objects can be combined on a single map. This is not the case with the {ggplot2} functions.
if(!require(tmap)) install.packages(pkgs = "tmap", repos = "http://cran.us.r-project.org")## Loading required package: tmap
Functions in the {tmap} use the ‘grammar of graphics’ philosophy that separates the data frame from the aesthetics (how data are made visible). Functions translate the data into aesthetics. The aesthetics can include the location on a geographic map (defined by the geometry), color, and other visual components.
A {tmap} map starts with the tm_shape() function that takes as input a spatial data frame. The function is followed by one or more layers such as tm_fill(), tm_dots(), tm_raster(), etc that defines how a property in the data gets translated to a visual component.
Returning to the New Zealand simple feature data frame (nz). To make a map of the region borders you first identify the spatial data frame with the tm_shape() function and then add a borders layer with the tm_borders() layer.
tmap::tm_shape(shp = spData::nz) +
tmap::tm_borders() The function tmap::tm_shape() and its subsequent drawing layers (here tmap::tm_borders()) as a ‘group’. The data in the tmap::tm_shape() function must be a spatial object of class simple feature, raster, or an S4 class spatial object.
Here you use a fill layer (tmap::tm_fill()) instead of the borders layer.
tmap::tm_shape(spData::nz) +
tmap::tm_fill() The multi-polygons are filled using the same gray color as the borders so they disappear.
In this next example you layer using the fill aesthetic and then add a border aesthetic.
tmap::tm_shape(spData::nz) +
tmap::tm_fill(col = 'green') +
tmap::tm_borders() Layers are added with the + operator and are functionally equivalent to adding a GIS layer.
You can assign the resulting map to an object. For example here you assign the map of New Zealand to the object map_nz.
map_nz <- tmap::tm_shape(spData::nz) +
tmap::tm_polygons()
class(map_nz)## [1] "tmap"
The resulting object is of class tmap.
New spatial data are added with + tm_shape(new_object). In this case new_object represents a new spatial data frame to be plotted over the preceding layers. When a new spatial data frame is added in this way, all subsequent aesthetic functions refer to it, until another spatial data frame is added.
For example, let’s add an elevation layer to the New Zealand map. The elevation raster (nz_elev) spatial data frame is in the {spDataLarge} package on GitHub.
The install_github() function from the {devtools} package is used to install packages on GitHub. GitHub is a company that provides hosting for software development version control using Git. Git is a version-control system for tracking changes in code during software development.
if(!require(devtools)) install.packages(pkgs = "devtools", repos = "http://cran.us.r-project.org")## Loading required package: devtools
## Loading required package: usethis
library(devtools)
if(!require(spDataLarge)) install_github(repo = "Nowosad/spDataLarge")## Loading required package: spDataLarge
library(spDataLarge)Next identify the spatial data for the the new layer by adding tm_shape(nz_elev). Then add the raster layer with the tm_raster() function and set the transparency level to 70% (alpha = .7).
( map_nz1 <- map_nz +
tmap::tm_shape(spDataLarge::nz_elev) +
tmap::tm_raster(alpha = .7) )## stars object downsampled to 877 by 1140 cells. See tm_shape manual (argument raster.downsample)
The new map object map_nz1 builds on top of the existing map object map_nz by adding the raster layer spDataLarge::nz_elev representing elevation.
You can create new layers with functions. For instance, a function like sf::st_union() operates on the geometry column of a simple feature data frame.
As an example, here you create a line string layer as a simple feature object using three geo-computation functions. You start by creating a union over all polygons (regions) with the sf::st_union() function applied to the spData::nz simple feature object. The result is a multi-polygon defining the coastlines.
Then you buffer this multi-polgyon out to a distance of 22.2 km using the sf::st_buffer() function. The result is a single polygon defining the coastal boundary around the entire country.
Finally you change the polygon geometry to a line string geometry with the sf::st_cast() function.
The operations are linked together with the pipe operator.
( nz_water.sfc <- spData::nz |>
sf::st_union() |>
sf::st_buffer(dist = 22200) |>
sf::st_cast(to = "LINESTRING") )## Geometry set for 1 feature
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: 1067944 ymin: 4726340 xmax: 2111732 ymax: 6214066
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## LINESTRING (1074909 4920220, 1074855 4920397, 1...
Now add the resulting sfc as a layer to our map.
( map_nz2 <- map_nz1 +
tmap::tm_shape(nz_water.sfc) +
tmap::tm_lines() )## stars object downsampled to 877 by 1140 cells. See tm_shape manual (argument raster.downsample)
Finally, create a layer representing the country elevation high points (stored in the object spData::nz_height) onto the map_nz2 object with tmap::tm_dots() function.
( map_nz3 <- map_nz2 +
tmap::tm_shape(spData::nz_height) +
tmap::tm_dots() )## stars object downsampled to 877 by 1140 cells. See tm_shape manual (argument raster.downsample)
Map layout, facets, and inserts
Layout functions help create a cartographic map. Elements include the title, the scale bar, margins, aspect ratios, etc. For example, here elements such as a north arrow and a scale bar are added with tm_compass() and tm_scale_bar(), respectively and the tm_layout() function is used to add the title and background color.
map_nz +
tm_compass(type = "8star",
position = c("left", "top")) +
tm_scale_bar(breaks = c(0, 100, 200),
text.size = 1) +
tm_layout(title = "New Zealand",
bg.color = "lightblue")## Compass not supported in view mode.
## Warning: In view mode, scale bar breaks are ignored.
Putting two or more maps with the same scale side by side allows for easy comparisons and to see how spatial relationships change with respect to another variable. Creating small multiples of the same map with different variables is called ‘faceting’.
Consider the simple feature data frame World from the {tmap} package. Make the data frame accessible to this session with the data() function.
library(tmap)
data(World)
head(World)## Simple feature collection with 6 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -73.41544 ymin: -55.25 xmax: 75.15803 ymax: 42.68825
## Geodetic CRS: WGS 84
## iso_a3 name sovereignt continent
## 1 AFG Afghanistan Afghanistan Asia
## 2 AGO Angola Angola Africa
## 3 ALB Albania Albania Europe
## 4 ARE United Arab Emirates United Arab Emirates Asia
## 5 ARG Argentina Argentina South America
## 6 ARM Armenia Armenia Asia
## area pop_est pop_est_dens economy
## 1 652860.00 [km^2] 28400000 43.50090 7. Least developed region
## 2 1246700.00 [km^2] 12799293 10.26654 7. Least developed region
## 3 27400.00 [km^2] 3639453 132.82675 6. Developing region
## 4 71252.17 [km^2] 4798491 67.34519 6. Developing region
## 5 2736690.00 [km^2] 40913584 14.95003 5. Emerging region: G20
## 6 28470.00 [km^2] 2967004 104.21510 6. Developing region
## income_grp gdp_cap_est life_exp well_being footprint inequality
## 1 5. Low income 784.1549 59.668 3.8 0.79 0.4265574
## 2 3. Upper middle income 8617.6635 NA NA NA NA
## 3 4. Lower middle income 5992.6588 77.347 5.5 2.21 0.1651337
## 4 2. High income: nonOECD 38407.9078 NA NA NA NA
## 5 3. Upper middle income 14027.1261 75.927 6.5 3.14 0.1642383
## 6 4. Lower middle income 6326.2469 74.446 4.3 2.23 0.2166481
## HPI geometry
## 1 20.22535 MULTIPOLYGON (((61.21082 35...
## 2 NA MULTIPOLYGON (((16.32653 -5...
## 3 36.76687 MULTIPOLYGON (((20.59025 41...
## 4 NA MULTIPOLYGON (((51.57952 24...
## 5 35.19024 MULTIPOLYGON (((-65.5 -55.2...
## 6 25.66642 MULTIPOLYGON (((43.58275 41...
The simple feature data frame has socio-economic indicators by country. Each row is a country.
Further, consider the simple feature data frame urban_agglomerations from the {spData} package. The data frame is from the United Nations population division with projections up to 2050 for the top 30 largest areas by population at 5 year intervals (in long form).
The geometries are points indicating the location of the largest urban metro areas.
You create a new data frame keeping only the years 1970, 1990, 2010, and 2030 by using the filter() function from the {dplyr} package.
urb_1970_2030 <- spData::urban_agglomerations |>
dplyr::filter(year %in% c(1970, 1990, 2010, 2030))Note that the operator %in% acts like a recursive or. If year == 1970 or year == 1990, … For example,
1969:2031 ## [1] 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983
## [16] 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
## [31] 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
## [46] 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028
## [61] 2029 2030 2031
1969:2031 %in% c(1970, 1990, 2010, 2030)## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE TRUE FALSE
Returns a series of TRUEs and FALSEs.
The first map layer is the country polygons from the World data frame and the second layer is city locations from the urb_1970_2030 data frame using the tmap::tm_symbols() function. The symbol size is scaled by the variable population_millions. Finally you group by the variable year with the tmap::tm_facets() function to produce a four-panel set of maps.
tmap::tm_shape(World) +
tmap::tm_polygons() +
tmap::tm_shape(urb_1970_2030) +
tmap::tm_symbols(col = "black",
border.col = "white",
size = "population_millions") +
tmap::tm_facets(by = "year",
nrow = 2,
free.coords = FALSE)## Legend for symbol sizes not available in view mode.
The above code chunk demonstrates key features of faceted maps created with functions from the {tmap} package.
- Shapes that do not have a facet variable are repeated (the countries in
Worldin this case). - The
by =argument which varies depending on a variable (yearin this case). - nrow/ncol setting specifying the number of rows (and columns) that facets should be arranged into.
- The
free.coords =argument specifies whether each map has its own bounding box.
Small multiples are also generated by assigning more than one value to one of the aesthetic arguments.
For example here you map the happiness index (HPI) on one map and gross domestic product per capita (gdp_cap_est) on another map. Both variables are in the World data frame.
tmap::tm_shape(World) +
tmap::tm_polygons(c("HPI", "gdp_cap_est"),
style = c("pretty", "kmeans"),
palette = list("RdYlGn", "Purples"),
title = c("Happy Planet Index", "GDP per capita")) Note that the variable names must be in quotes (e.g., “HPI”).
The maps are identical except for the variable being plotted. All arguments of the layer functions can be vectorized, one for each map. Arguments that normally take a vector, such as palette =, are placed in a list().
Multiple map objects can also be arranged in a single plot with the tmap::tmap_arrange() function. Here you create two separate maps then arrange them.
map1 <- tmap::tm_shape(World) +
tmap::tm_polygons("HPI",
style = "pretty",
palette = "RdYlGn",
title = "Happy Planet Index")
map2 <- tmap::tm_shape(World) +
tmap::tm_polygons("gdp_cap_est",
style = "kmeans",
palette = "Purples",
title = "GDP per capita")
tmap_arrange(map1, map2)Example: COVID19 vaccinations by state on Saturday February 6, 2021. Get the data.
f <- "https://raw.githubusercontent.com/owid/covid-19-data/e2da3a49250481a8a22f993ee5c3731111ba6958/scripts/scripts/vaccinations/us_states/input/cdc_data_2021-02-06.csv"
df <- readr::read_csv(f)## Rows: 65 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Location, ShortName, LongName
## dbl (14): Census2019, Doses_Distributed, Doses_Administered, Dist_Per_100K,...
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Get a US census mapfrom the {USAboundaries} package. Rename the state name column (name) to LongName.
sf <- USAboundaries::us_states() |>
dplyr::filter(!name %in% c("District of Columbia", "Puerto Rico", "Hawaii", "Alaska")) |>
dplyr::rename(LongName = name)Join the COVID data frame with the simple feature data frame from the census. Then make a map showing the doses administered per 100K people.
sf <- sf |>
dplyr::left_join(df,
by = "LongName")
tmap::tm_shape(sf) +
tmap::tm_fill(col = "Admin_Per_100K", title = "Per 100K" ) +
tmap::tm_borders(col = "gray70") +
tmap::tm_layout(legend.outside = TRUE)Creating an interactive map
A nice feature of the {tmap} package is that you can create an interactive map using the same code used to create a static map.
For example, with the mode set to "view" in the tmap::tmap_mode() function the county boundary map created from the FLcounties.sf simple feature data frame using the {tmap} functions is interactive.
tmap::tmap_mode("view")## tmap mode set to interactive viewing
tmap::tm_shape(FLcounties.sf) +
tmap::tm_borders()Click on the layer symbol and change to OpenStreetMap.
With the interactive mode turned on, all maps produced with {tmap} launch as zoom-able HTML. This feature includes the ability to specify the base map with tm_basemap() (or tmap_options()) as demonstrated here.
map_nz +
tmap::tm_basemap(server = "OpenTopoMap")You can also create interactive maps with the tmap_leaflet() function.
The view mode in {tmap} works with faceted plots. The argument sync in tm_facets() is used to produce multiple maps with synchronized zoom and pan settings.
world_coffee <- dplyr::left_join(spData::world,
spData::coffee_data,
by = "name_long")
tmap::tm_shape(world_coffee) +
tmap::tm_polygons(c("coffee_production_2016",
"coffee_production_2017")) +
tmap::tm_facets(nrow = 1, sync = TRUE)Change the view mode back to plot.
tmap_mode("plot")## tmap mode set to plotting
Adding an inset map
An inset map puts the geographic study area into context. Here you create a map of the central part of New Zealand’s Southern Alps. The inset map shows where the main map is in relation to the rest of New Zealand.
The first step is to define the area of interest. Here it is done here by creating a new spatial object nz_region using the sf::st_bbox() function and the sf::st_as_sfc() to make it a simple feature column.
nz_region <- sf::st_bbox(c(xmin = 1340000, xmax = 1450000,
ymin = 5130000, ymax = 5210000),
crs = sf::st_crs(spData::nz_height)) |>
sf::st_as_sfc()Next create a base map showing New Zealand’s Southern Alps area. This is the closeup view of where the most important message is stated. The region is clipped to the simple feature column nz_region created above. The layers include a raster of elevations and locations of high points. A scale bar is included.
( nz_height_map <- tmap::tm_shape(nz_elev,
bbox = nz_region) +
tmap::tm_raster(style = "cont",
palette = "YlGn",
legend.show = TRUE) +
tmap::tm_shape(spData::nz_height) +
tmap::tm_symbols(shape = 2,
col = "red",
size = 1) +
tmap::tm_scale_bar(position = c("left", "bottom")) )## stars object downsampled to 877 by 1140 cells. See tm_shape manual (argument raster.downsample)

Next create the inset map. It gives a context and helps to locate the area of interest. This map clearly indicates the location of the main map.
( nz_map <- tmap::tm_shape(spData::nz) +
tmap::tm_polygons() +
tmap::tm_shape(spData::nz_height) +
tmap::tm_symbols(shape = 2,
col = "red",
size = .1) +
tmap::tm_shape(nz_region) +
tmap::tm_borders(lwd = 3) )
Finally combine the two maps. The viewport() function from the {grid} package is used to give a center location (x and y) and the size (width and height) of the inset map.
library(grid)
nz_height_map## stars object downsampled to 877 by 1140 cells. See tm_shape manual (argument raster.downsample)
print(nz_map,
vp = viewport(.8, .27, width = .5, height = .5))
Additional details and examples on making maps in R are available in the book “Geocomputation with R” by Lovelace, Nowosad, and Muenchow https://geocompr.robinlovelace.net/adv-map.html
Mapping walking (etc) distances. https://walker-data.com/mapboxapi/
Tuesday September 27, 2022
“You haven’t mastered a tool until you understand when it should not be used.” – Kelsey Hightower
Today
- Defining spatial neighborhoods and spatial weights
- Computing spatial autocorrelation
- Spatial lag and its relation to autocorrelation
Defining spatial neighborhoods and spatial weights
Autocorrelation plays a central role in spatial statistics. It measures the degree to which things tend to cluster. Things include attribute values aggregated to polygons (or raster cells) as well as locations. How autocorrelation gets estimated depends on the geometry of the spatial data.
Things tend to cluster because of:
Association: whatever causes an attribute to have a certain value in one area causes the same attribute to have a similar value in areas nearby. Crime rates in nearby neighborhoods might tend to cluster due to similar factors.
Causality: something within a given area directly influences outcomes within nearby areas. Non-infectious diseases (e.g., lung cancer) have similar rates in neighborhoods close to an oil refinery.
Interaction: the movement of people, goods or information creates relationships between areas. COVID spreads through areas through the movement of people.
Spatial statistics quantify, and condition on, autocorrelation but they are silent about physical causes. Understanding the reason for autocorrelation in your data is important for inference because the causal mechanism might be confounded by its. The divorce rate is high in southern states, but so is the number of Waffle Houses. Understanding causation requires domain specific knowledge.
When a variable’s values are aggregated (summed or averaged) to regions, autocorrelation is quantified by calculating how similar a value in region \(i\) is to the value in region \(j\) and weighting this similarity by how ‘close’ region \(i\) is to region \(j\). Closer regions are given greater weight.
High similarities with high weight (similar values close together) yield high values of spatial autocorrelation. Low similarities with high weight (dissimilar values close together) yield low values of spatial autocorrelation. Let \(\hbox{sim}_{ij}\) denote the similarity between values \(Y_i\) and \(Y_j\), and let \(w_{ij}\) denote a set of weights describing the ‘distance’ between regions \(i\) and \(j\), for \(i\), \(j\) = 1, …, \(N\).
A general spatial autocorrelation index (SAI) is given by \[ \hbox{SAI} = \frac{\sum_{i,j=1}^N w_{ij}\hbox{sim}_{ij}}{\sum_{i,j=1}^N w_{ij}} \] which represents the weighted similarity between regions. The set of weights (\(w_{ij}\)) is called a spatial weights matrix. The spatial weights matrix defines the neighbors for each region and defines the strength of each association.
For cells in a raster under the rook-contiguity criterion, \(w_{ij}\) = 1 if cell \(i\) and \(j\) share a boundary, and 0 if they don’t share a boundary. In this case \(w_{ij}\) = \(w_{ji}\). Also, a cell is not a neighbor of itself so \(w_{ii}\) = 0.
Alternatively you can define center locations from a set of polygon regions and let \(w_{ij}\) = 1 if the center of region \(i\) is near the center of region \(j\) and 0 otherwise. Here you need to decide on the number of nearest neighbors.
You can also define neighbors by distance. For example, if \(d_{ij}\) is the distance between centers \(i\) and \(j\), you can let \(w_{ij}\) = 1 if \(d_{ij}\) < \(\delta\) and 0 otherwise.
Consider crime data at the tract level in the city of Columbus, Ohio. The tract polygons are projected with arbitrary spatial coordinates.
if(!"columbus" %in% list.files("data")) {
download.file(url = "http://myweb.fsu.edu/jelsner/temp/data/columbus.zip",
destfile = here::here("data", "columbus.zip"))
unzip(here::here("data", "columbus.zip"),
exdir = here::here("data"))
}
( CC.sf <- sf::st_read(dsn = here::here("data", "columbus"),
layer = "columbus") )## Reading layer `columbus' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/columbus'
## using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## First 10 features:
## AREA PERIMETER COLUMBUS_ COLUMBUS_I POLYID NEIG HOVAL INC CRIME
## 1 0.309441 2.440629 2 5 1 5 80.467 19.531 15.725980
## 2 0.259329 2.236939 3 1 2 1 44.567 21.232 18.801754
## 3 0.192468 2.187547 4 6 3 6 26.350 15.956 30.626781
## 4 0.083841 1.427635 5 2 4 2 33.200 4.477 32.387760
## 5 0.488888 2.997133 6 7 5 7 23.225 11.252 50.731510
## 6 0.283079 2.335634 7 8 6 8 28.750 16.029 26.066658
## 7 0.257084 2.554577 8 4 7 4 75.000 8.438 0.178269
## 8 0.204954 2.139524 9 3 8 3 37.125 11.337 38.425858
## 9 0.500755 3.169707 10 18 9 18 52.600 17.586 30.515917
## 10 0.246689 2.087235 11 10 10 10 96.400 13.598 34.000835
## OPEN PLUMB DISCBD X Y NSA NSB EW CP THOUS NEIGNO
## 1 2.850747 0.217155 5.03 38.80 44.07 1 1 1 0 1000 1005
## 2 5.296720 0.320581 4.27 35.62 42.38 1 1 0 0 1000 1001
## 3 4.534649 0.374404 3.89 39.82 41.18 1 1 1 0 1000 1006
## 4 0.394427 1.186944 3.70 36.50 40.52 1 1 0 0 1000 1002
## 5 0.405664 0.624596 2.83 40.01 38.00 1 1 1 0 1000 1007
## 6 0.563075 0.254130 3.78 43.75 39.28 1 1 1 0 1000 1008
## 7 0.000000 2.402402 2.74 33.36 38.41 1 1 0 0 1000 1004
## 8 3.483478 2.739726 2.89 36.71 38.71 1 1 0 0 1000 1003
## 9 0.527488 0.890736 3.17 43.44 35.92 1 1 1 0 1000 1018
## 10 1.548348 0.557724 4.33 47.61 36.42 1 1 1 0 1000 1010
## geometry
## 1 POLYGON ((8.624129 14.23698...
## 2 POLYGON ((8.25279 14.23694,...
## 3 POLYGON ((8.653305 14.00809...
## 4 POLYGON ((8.459499 13.82035...
## 5 POLYGON ((8.685274 13.63952...
## 6 POLYGON ((9.401384 13.5504,...
## 7 POLYGON ((8.037741 13.60752...
## 8 POLYGON ((8.247527 13.58651...
## 9 POLYGON ((9.333297 13.27242...
## 10 POLYGON ((10.08251 13.03377...
The simple feature data frame contains housing values (HOVAL), income values (INC) and (CRIME) in census tracts across the city. Crime (CRIME) is residential burglaries and vehicle thefts per 1000 households. Income (INC) and housing values (HOVAL) are annual values with units of 1000 dollars.
Create a choropleth map of the crime rates (CRIME).
tmap::tm_shape(CC.sf) +
tmap::tm_fill(col = "CRIME",
title = "Burglary & Vehicle Thefts\n/1000 Households")## Warning: Currect projection of shape CC.sf unknown. Long-lat (WGS84) is assumed.

Note that the variable name CRIME must be in quotes.
Alternatively we create a choropleth map of the crime rates using geom_sf(). Here the variable name CRIME is without quotes.
library(ggplot2)
ggplot(data = CC.sf) +
geom_sf(mapping = aes(fill = CRIME)) +
labs(fill = "Burglary & Vehicle Thefts\n/1000 Households") +
theme_void()
High crime areas tend to be clustered.
Autocorrelation quantifies the amount of clustering. To compute the autocorrelation you first need to define the neighbors for each polygon.
You create a list of neighbors using the spdep::poly2nb() function from the {spdep} package. The ‘nb’ in the function names stands for neighbor list object. The function builds the list from geometries based on contiguity. Neighbors must share at least one geographic location. By default the contiguity is defined as having at least one location in common. This is changed by using the argument queen = FALSE. Functions in the {spdep} package support S3 and S4 spatial data objects.
if(!require(spdep)) install.packages("spdep", repos = "http://cran.us.r-project.org")## Loading required package: spdep
( nbs <- spdep::poly2nb(CC.sf) )## Neighbour list object:
## Number of regions: 49
## Number of nonzero links: 236
## Percentage nonzero weights: 9.829238
## Average number of links: 4.816327
Note that this only works for spatial data frames.
The output tells you there are 49 tracts (polygons). Each tract is bordered by at least one other tract. The average number of neighbors is 4.8. The total number of neighbors over all tracts is 236. This represents 9.8% of all possible connections (if every tract is a neighbor of itself and every other tract 49 * 49).
A graph of the neighbor links is obtained with the plot() method. The arguments include the neighbor list object (nbs) and the location of the polygon centers, which are extracted from the simple feature data frame using the sf::st_centroid().
plot(CC.sf$geometry)
plot(nbs,
sf::st_centroid(CC.sf$geometry),
add = TRUE)
The graph is a network showing the contiguity pattern (adjacency neighbor structure). Tracts close to the center of the city have more neighboring tracts and thus more links in the network.
The number of links per tract (node)–link distribution–is obtained with the summary() method.
summary(nbs)## Neighbour list object:
## Number of regions: 49
## Number of nonzero links: 236
## Percentage nonzero weights: 9.829238
## Average number of links: 4.816327
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10
## 5 9 12 5 9 3 4 1 1
## 5 least connected regions:
## 1 6 42 46 47 with 2 links
## 1 most connected region:
## 20 with 10 links
The list of neighboring tracts for the first two tracts.
nbs[[1]]## [1] 2 3
nbs[[2]]## [1] 1 3 4
The first tract has two neighbors that include tracts 2 and 3. The neighbor numbers are stored as an integer vector within the nb object. Tract 2 has three neighbors that include tracts 1, 3, and 4. Tract 5 has 8 neighbors and so on. The function spdep::card() tallies the number of neighbors by tract.
spdep::card(nbs)## [1] 2 3 4 4 8 2 4 6 8 4 5 6 4 6 6 8 3 4 3 10 3 6 3 7 8
## [26] 6 4 9 7 5 3 4 4 4 7 5 6 6 3 5 3 2 6 5 4 2 2 4 3
Tract 5 has 8 neighbors and so on.
The next step is to include weights to the neighbor list object indicate how close each neighbor is. The function spdep::nb2listw() turns the neighbor list object into a spatial weights object. By default the weighting scheme gives each link the same weight equal to the multiplicative inverse of the number of neighbors.
wts <- nbs |>
spdep::nb2listw()
class(wts)## [1] "listw" "nb"
This wts object is a list with two elements. The first element (listw) is the weights matrix and the second element (nb) is the neighbor list object.
summary(wts)## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 49
## Number of nonzero links: 236
## Percentage nonzero weights: 9.829238
## Average number of links: 4.816327
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10
## 5 9 12 5 9 3 4 1 1
## 5 least connected regions:
## 1 6 42 46 47 with 2 links
## 1 most connected region:
## 20 with 10 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 49 2401 49 22.75119 203.7091
The network statistics are given along with information about the weights. The default weighting scheme assigns a weight to each neighbor equal to the inverse of the number of neighbors (style = "W"). For a tract with 5 neighbors each neighbor gets a weight of 1/5. The sum over all weights (S0) is the number of tracts.
To see the weights for the first two tracts type
wts$weights[1:2]## [[1]]
## [1] 0.5 0.5
##
## [[2]]
## [1] 0.3333333 0.3333333 0.3333333
The object weights represents the weights matrix as a list. The full matrix has dimensions 49 x 49 but most of the entries are zero.
nbs |>
spdep::nb2mat() |>
head()## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## 1 0.0000000 0.50 0.5000000 0.0000000 0.00 0.000 0 0.000 0.000 0 0.000
## 2 0.3333333 0.00 0.3333333 0.3333333 0.00 0.000 0 0.000 0.000 0 0.000
## 3 0.2500000 0.25 0.0000000 0.2500000 0.25 0.000 0 0.000 0.000 0 0.000
## 4 0.0000000 0.25 0.2500000 0.0000000 0.25 0.000 0 0.250 0.000 0 0.000
## 5 0.0000000 0.00 0.1250000 0.1250000 0.00 0.125 0 0.125 0.125 0 0.125
## 6 0.0000000 0.00 0.0000000 0.0000000 0.50 0.000 0 0.000 0.500 0 0.000
## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## 1 0 0 0 0.000 0.000 0 0 0 0 0 0 0 0
## 2 0 0 0 0.000 0.000 0 0 0 0 0 0 0 0
## 3 0 0 0 0.000 0.000 0 0 0 0 0 0 0 0
## 4 0 0 0 0.000 0.000 0 0 0 0 0 0 0 0
## 5 0 0 0 0.125 0.125 0 0 0 0 0 0 0 0
## 6 0 0 0 0.000 0.000 0 0 0 0 0 0 0 0
## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37]
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0
## [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49]
## 1 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0
To see the neighbors of the first two tracts type
wts$neighbours[1:2]## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1 3 4
Tract 1 has two neighbors (tract 2 & 3) so each are given a weight of 1/2. Tract 2 has three neighbors (tract 1, 3, & 4) so each are given a weight of 1/3.
With the weights matrix saved as an object you are ready to compute a metric of spatial autocorrelation.
Caution: Neighbors defined by contiguity can leave some areas without any. Islands for example. By default the spdep::nb2listw() function assumes each area has at least one neighbor. If this is not the case you need to specify how areas without neighbors are handled using the argument zero.policy = TRUE. This permits the weights list to be formed with zero-length weights vectors.
For example, consider the districts in the country of Scotland.
if(!"scotlip" %in% list.files(here::here("data"))) {
download.file("http://myweb.fsu.edu/jelsner/temp/data/scotlip.zip",
destfile = here::here("data", "scotlip.zip"))
unzip(here::here("data", "scotlip.zip"),
exdir = here::here("data"))
}
SL.sf <- sf::st_read(dsn = here::here("data", "scotlip"),
layer = "scotlip")## Reading layer `scotlip' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/scotlip' using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 11 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 95631 ymin: 530297 xmax: 454570 ymax: 1203008
## CRS: NA
plot(SL.sf$geometry)
Three of the districts are islands. These districts have no bordering districts.
Create a list of neighbors.
( nbs2 <- SL.sf |>
spdep::poly2nb() )## Neighbour list object:
## Number of regions: 56
## Number of nonzero links: 234
## Percentage nonzero weights: 7.461735
## Average number of links: 4.178571
## 3 regions with no links:
## 6 8 11
Three regions with no links.
Use the spdep::nb2listw() function with the argument zero.policy = TRUE. Otherwise we get an error saying the empty neighbor sets are found.
wts2 <- nbs2 |>
spdep::nb2listw(zero.policy = TRUE)
head(wts2$weights)## [[1]]
## [1] 0.3333333 0.3333333 0.3333333
##
## [[2]]
## [1] 0.5 0.5
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 0.3333333 0.3333333 0.3333333
##
## [[5]]
## [1] 0.3333333 0.3333333 0.3333333
##
## [[6]]
## NULL
Computing autocorrelation
A common autocorrelation statistic is Moran’s I. Moran’s I follows the basic form of autocorrelation indexes where the similarity between regions \(i\) and \(j\) is proportional to the product of the deviations from the mean \[ \hbox{sim}_{ij} \propto (Y_i - \bar Y) (Y_j - \bar Y) \] where \(i\) indexes the region and \(j\) indexes the neighbors of \(i\). The value of \(\hbox{sim}_{ij}\) is large when the \(Y\) values in the product are on the same side of their respective means (both above or below) and small when they are on opposite sides of their respective means (one above and one below or vice versa).
The formula for I is \[ \hbox{I} = \frac{N} {W} \frac {\sum_{i,j} w_{ij}(Y_i-\bar Y) (Y_j-\bar Y)} {\sum_{i} (Y_i-\bar Y)^2} \] where \(N\) is the number regions, \(w_{ij}\) is the matrix of weights, and \(W\) is the sum over all weights.
Consider the following grid of cells containing attribute values.
if(!require(spatstat)) install.packages(pkgs = "spatstat", repos = "http://cran.us.r-project.org")## Loading required package: spatstat
## Loading required package: spatstat.data
## Loading required package: spatstat.geom
## spatstat.geom 2.4-0
##
## Attaching package: 'spatstat.geom'
## The following object is masked from 'package:grid':
##
## as.mask
## The following objects are masked from 'package:raster':
##
## area, rotate, shift
## The following object is masked from 'package:patchwork':
##
## area
## Loading required package: spatstat.random
## spatstat.random 2.2-0
## Loading required package: spatstat.core
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:raster':
##
## getData
## Loading required package: rpart
## spatstat.core 2.4-4
## Loading required package: spatstat.linnet
## spatstat.linnet 2.3-2
##
## spatstat 2.3-4 (nickname: 'Watch this space')
## For an introduction to spatstat, type 'beginner'
suppressMessages(library(spatstat))
set.seed(6750)
Y <- ppp(runif(200, 0, 1),
runif(200, 0, 1))
plot(quadratcount(Y), main = "")
The formula for results in one value of I representing the magnitude of the autocorrelation (amount of clustering) over the entire area.
First consider a single cell in the area (\(N\) = 1). Start with the middle cell (row 3, column 3). Let \(i\) = 3 in the above formula and let \(j\) index the cells touching the center cell in reading order starting with cell (2, 2), then cell (2, 3), etc.
Assume each neighbor is given a weight of 1/8 so \(W = \sum_{j=1}^8 w_j = 1\). Then the value of I for the single center cell is I_{3, 3} = (6 - mean(y)) * ((8 - mean(y)) + (3 - mean(y)) + (9 - mean(y)) + (12 - mean(y)) + (10 - mean(y)) + (10 - mean(y)) + (9 - mean(y))) / (6 - mean(y))^2)
y <- c(3, 10, 7, 12, 5, 11, 8, 3, 9, 12,
6, 12, 6, 10, 3, 8, 10, 10, 9, 7,
5, 10, 8, 5, 11)
( yb <- mean(y) )## [1] 8
Inum_i <- (6 - yb) *
((8 - yb) + (3 - yb) + (9 - yb) +
(12 - yb) + (10 - yb) + (10 - yb) +
(10 - yb) + (9 - yb))
Iden_i <- (6 - yb)^2
Inum_i/Iden_i## [1] -3.5
The I value of -3.5 indicates that the center cell, which has a value below the average over all 25 cells, is mostly surrounded by cells having values above the average.
Repeat this calculation for every cell and then take the sum.
This is what the function spdep::moran() from the {spdep} package does. The first argument is the vector containing the values for which you are interested in determining the magnitude of the spatial autocorrelation and the second argument is the listw object.
Further, you need to specify the number of regions and the sum of the weights S0. The latter is obtained from the spdep::Szero() function applied to the listw object.
Returning to the Columbus crime data here let m be the number of census tracts and s be the sum of the weights. You then apply the spdep::moran() function on the variable CRIME.
m <- length(CC.sf$CRIME)
s <- spdep::Szero(wts)
spdep::moran(CC.sf$CRIME,
listw = wts,
n = m,
S0 = s)## $I
## [1] 0.5001886
##
## $K
## [1] 2.225946
The function returns the Moran’s I statistic and the kurtosis (K) of the distribution of crime values. Moran’s I ranges from -1 to +1.
The value of .5 for the crime rates indicates a high level of spatial autocorrelation. This is expected based on the clustering of crime in the central city.
Positive values of Moran’s I indicate clustering and negative values indicate inhibition. Inhibition is a process leading to nearby values having attribute values that are opposite in magnitude as those at each location (like a checkerboard pattern)
Kurtosis is a statistic that indicates how peaked the distribution of the attribute values is. A normal distribution has a kurtosis of 3. If the kurtosis is too large (or small) relative to a normal distribution then any statistical inference we make with Moran’s I will be suspect.
Another statistic that indicates the amount of spatial autocorrelation is Geary’s C. The equation is \[ \hbox{C} = \frac{(N-1) \sum_{i,j} w_{ij} (Y_i-Y_j)^2}{2 W \sum_{i}(Y_i-\bar Y)^2} \] where \(W\) is the sum over all weights (\(w_{ij}\)) and \(N\) is the number of areas.
The syntax of the spdep::geary() function is similar that of spdep::moran() except you also specify n1 to be one minus the number of areas.
spdep::geary(CC.sf$CRIME,
listw = wts,
n = m,
S0 = s,
n1 = m - 1)## $C
## [1] 0.5405282
##
## $K
## [1] 2.225946
Values for Geary’s C range from 0 to 2 with 1 indicating no autocorrelation. Values less than 1 indicate positive autocorrelation. Both I and C are global measures of autocorrelation, but C is more sensitive to local variations in autocorrelation.
Rule of thumb: If the interpretation of Geary’s C is much different than the interpretation of Moran’s I then consider computing local measures of autocorrelation.
Spatial lag and its relation to autocorrelation
The interpretation of Moran’s I is simplified by the fact that the value of Moran’s I is the slope coefficient from a regression of the weighted average of the neighborhood values onto the observed values.
The weighted average of neighborhood values is called the spatial lag.
Let crime be the set of crime values in each region as a data vector. You create a spatial lag variable using the spdep::lag.listw() function. The first argument is the listw object and the second is the vector of crime values.
crime <- CC.sf$CRIME
Wcrime <- spdep::lag.listw(wts,
crime)For each value in the vector crime there is a corresponding value in the vector Wcrime representing the average crime over the neighboring regions.
Recall tract 1 had tract 2 and 3 as its only neighbors. So the following should return a TRUE.
Wcrime[1] == (crime[2] + crime[3])/2## [1] TRUE
A scatter plot of the neighborhood average crime rates versus the individual polygon crime rates in each shows there is a relationship.
data.frame(crime, Wcrime) |>
ggplot(mapping = aes(x = crime, y = Wcrime)) +
geom_point() +
geom_smooth(method = lm) +
scale_x_continuous(limits = c(0, 70)) +
scale_y_continuous(limits = c(0, 70)) +
xlab("Crime") +
ylab("Average Crime in the Neighborhood") +
theme_minimal()## `geom_smooth()` using formula 'y ~ x'

The vertical axis contains the neighborhood average crime rate. The range of neighborhood averages is smaller than the range of individual polygon crime rates.
Tracts with low values of crime tend to be surrounded by tracts with low values of crime on average and tracts with high values of crime tend be surrounded by tracts with high values of crime. The slope is upward (positive).
The magnitude of the slope is the Moran’s I value. To check this use the lm() function from the base set of packages. The function is used to fit linear regression models.
lm(Wcrime ~ crime)##
## Call:
## lm(formula = Wcrime ~ crime)
##
## Coefficients:
## (Intercept) crime
## 17.4797 0.5002
The coefficient on the crime variable in the linear regression is .5.
The scatter plot is called a ‘Moran’s scatter plot.’
Let’s consider another data set.
if(!"sids2" %in% list.files(here::here("data"))) {
download.file("http://myweb.fsu.edu/jelsner/temp/data/sids2.zip",
destfile = here::here("data", "sids2.zip"))
unzip(here::here("data", "sids2.zip"),
exdir = here::here("data"))
}
SIDS.sf <- sf::st_read(dsn = here::here("data", "sids2")) |>
sf::st_set_crs(4326)## Reading layer `sids2' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/sids2' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## CRS: NA
head(SIDS.sf)## Simple feature collection with 6 features and 18 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## Geodetic CRS: WGS 84
## AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 0.114 1.442 1825 1825 Ashe 37009 37009 5 1091 1
## 2 0.061 1.231 1827 1827 Alleghany 37005 37005 3 487 0
## 3 0.143 1.630 1828 1828 Surry 37171 37171 86 3188 5
## 4 0.070 2.968 1831 1831 Currituck 37053 37053 27 508 1
## 5 0.153 2.206 1832 1832 Northampton 37131 37131 66 1421 9
## 6 0.097 1.670 1833 1833 Hertford 37091 37091 46 1452 7
## NWBIR74 BIR79 SID79 NWBIR79 SIDR74 SIDR79 NWR74 NWR79
## 1 10 1364 0 19 0.916590 0.000000 9.165903 13.92962
## 2 10 542 3 12 0.000000 5.535055 20.533881 22.14022
## 3 208 3616 6 260 1.568381 1.659292 65.244668 71.90265
## 4 123 830 2 145 1.968504 2.409639 242.125984 174.69879
## 5 1066 1606 3 1197 6.333568 1.867995 750.175932 745.33001
## 6 954 1838 5 1237 4.820937 2.720348 657.024793 673.01415
## geometry
## 1 MULTIPOLYGON (((-81.47276 3...
## 2 MULTIPOLYGON (((-81.23989 3...
## 3 MULTIPOLYGON (((-80.45634 3...
## 4 MULTIPOLYGON (((-76.00897 3...
## 5 MULTIPOLYGON (((-77.21767 3...
## 6 MULTIPOLYGON (((-76.74506 3...
The column SIDR79 contains the death rate (per 1000 live births) (1979-84) from sudden infant death syndrome. Create a choropleth map of the SIDS rates.
tmap::tm_shape(SIDS.sf) +
tmap::tm_fill("SIDR79", title = "") +
tmap::tm_borders(col = "gray70") +
tmap::tm_layout(title = "SIDS Rates 1979-84 [per 1000]",
legend.outside = TRUE)
Create a neighborhood list (nb) and a listw object (wts) then graph the neighborhood network.
nbs <- spdep::poly2nb(SIDS.sf)
wts <- spdep::nb2listw(nbs)
plot(nbs,
sf::st_centroid(st_geometry(SIDS.sf)))
Next compute Moran’s I on the SIDS rates over the period 1974–1979.
m <- length(SIDS.sf$SIDR79)
s <- spdep::Szero(wts)
spdep::moran(SIDS.sf$SIDR79,
listw = wts,
n = m,
S0 = s)## $I
## [1] 0.1427504
##
## $K
## [1] 4.44434
I is .14 and K is 4.4. A normal distribution has a kurtosis of 3. Values less than about 2 or greater than about 4 indicate that inferences about autocorrelation based on the assumption of normality are suspect.
Weights are specified using the style = argument in the nb2listw() function. The default “W” is row standardized (sum of the weights over all links equals the number of polygons). “B” is binary (each neighbor gets a weight of one). “S” is a variance stabilizing scheme.
Each style gives a somewhat different value for I.
x <- SIDS.sf$SIDR79
spdep::moran.test(x, spdep::nb2listw(nbs, style = "W"))$estimate[1]## Moran I statistic
## 0.1427504
spdep::moran.test(x, spdep::nb2listw(nbs, style = "B"))$estimate[1] # binary## Moran I statistic
## 0.1105207
spdep::moran.test(x, spdep::nb2listw(nbs, style = "S"))$estimate[1] # variance-stabilizing## Moran I statistic
## 0.1260686
When reporting a Moran’s I you need to state what type of weighting was used.
Let sids be a vector with elements containing the SIDS rate in each county. You create a spatial lag variable using the spdep::lag.listw() function. The first argument is the listw object and the second is the vector of rates.
sids <- SIDS.sf$SIDR79
Wsids <- spdep::lag.listw(wts,
sids)For each value in the vector sids there is a corresponding value in the object Wsids representing the neighborhood average SIDS rate.
Wsids[1]## [1] 2.65921
j <- wts$neighbours[[1]]
j## [1] 2 18 19
sum(SIDS.sf$SIDR79[j])/length(j)## [1] 2.65921
The weight for county one is Wsids[1] = 2.659. The neighbor indexes for this county are in the vector wts$neighbours[[1]] of length 3. Add the SIDS rates from those counties and divide by the number of counties (length(j)).
A scatter plot of the neighborhood average SIDS rate versus the actual SIDS rate in each region.
data.frame(sids, Wsids) |>
ggplot(aes(x = sids, y = Wsids)) +
geom_point() +
geom_smooth(method = lm) +
scale_x_continuous(limits = c(0, 7)) +
scale_y_continuous(limits = c(0, 7)) +
xlab("SIDS") + ylab("Spatial Lag of SIDS") +
theme_minimal()## `geom_smooth()` using formula 'y ~ x'

The regression line slopes upward indicating positive spatial autocorrelation. The value of the slope is I. To check this type
lm(Wsids ~ sids)##
## Call:
## lm(formula = Wsids ~ sids)
##
## Coefficients:
## (Intercept) sids
## 1.7622 0.1428
Thursday September 29, 2022
“Be curious. Read widely. Try new things. I think a lot of what people call intelligence boils down to curiosity.” - Aaron Swartz
Today
- Other neighbor definitions
- Assessing the statistical significance of autocorrelation
- Bivariate spatial autocorrelation
- Local indicators of spatial autocorrelation
Other neighbor definitions
Last time you saw how to compute autocorrelation using areal aggregated data. The procedure involves a weights matrix, which you created using the default neighborhood definition and the weighting scheme with functions from the {spdep} package.
It was noted that the magnitude of autocorrelation depends on the weighting scheme used. Other neighborhood definitions are possible and they will also influence the magnitude of the autocorrelation.
Let’s consider the historical demographic data in Mississippi counties. Import the data as a simple feature data frame and assign the geometry a geographic CRS.
if(!"police" %in% list.files(here::here("data"))) {
download.file(url = "http://myweb.fsu.edu/jelsner/temp/data/police.zip",
destfile = here::here("data", "police.zip"))
unzip(here::here("data", "police.zip"),
exdir = here::here("data"))
}
( PE.sf <- sf::st_read(dsn = here::here("data", "police"),
layer = "police") |>
sf::st_set_crs(4326) )## Reading layer `police' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/police' using driver `ESRI Shapefile'
## Simple feature collection with 82 features and 21 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -91.64356 ymin: 30.19474 xmax: -88.09043 ymax: 35.00496
## CRS: NA
## Simple feature collection with 82 features and 21 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -91.64356 ymin: 30.19474 xmax: -88.09043 ymax: 35.00496
## Geodetic CRS: WGS 84
## First 10 features:
## AREA PERIMETER CNTY_ CNTY_ID NAME STATE_NAME STATE_FIPS CNTY_FIPS
## 1 0.105 1.401 2129 2129 Alcorn Mississippi 28 003
## 2 0.111 1.485 2130 2130 Tishomingo Mississippi 28 141
## 3 0.116 1.519 2131 2131 Tippah Mississippi 28 139
## 4 0.105 1.478 2132 2132 Benton Mississippi 28 009
## 5 0.127 1.774 2133 2133 De Soto Mississippi 28 033
## 6 0.181 1.911 2134 2134 Marshall Mississippi 28 093
## 7 0.119 2.146 2155 2155 Tunica Mississippi 28 143
## 8 0.103 1.571 2171 2171 Tate Mississippi 28 137
## 9 0.106 1.394 2176 2176 Prentiss Mississippi 28 117
## 10 0.109 1.493 2199 2199 Union Mississippi 28 145
## FIPS FIPSNO POLICE POP TAX TRANSFER INC CRIME UNEMP OWN COLLEGE WHITE
## 1 28003 28003 706 32500 122 12428 8206 43 7 70 23 89
## 2 28141 28141 247 19100 112 7278 6666 316 8 73 18 96
## 3 28139 28139 296 18800 93 8606 6865 5 7 71 18 84
## 4 28009 28009 116 8400 100 3494 6083 24 12 75 16 62
## 5 28033 28033 1063 56400 116 18555 8731 36 6 77 26 82
## 6 28093 28093 549 30900 87 10370 5825 316 11 71 22 47
## 7 28143 28143 291 9500 153 5354 6019 42 15 43 18 27
## 8 28137 28137 444 20500 137 13783 7837 20 8 67 29 61
## 9 28117 28117 455 24400 118 14650 6361 41 6 74 20 89
## 10 28145 28145 364 21400 117 8207 7530 46 5 73 22 86
## COMMUTE geometry
## 1 8 POLYGON ((-88.35416 34.7626...
## 2 8 POLYGON ((-88.32171 34.4693...
## 3 15 POLYGON ((-88.72614 34.6048...
## 4 41 POLYGON ((-89.23874 34.5935...
## 5 2 POLYGON ((-90.20186 34.7297...
## 6 12 POLYGON ((-89.66407 34.5659...
## 7 3 POLYGON ((-90.19978 34.5617...
## 8 11 POLYGON ((-89.71541 34.5659...
## 9 23 POLYGON ((-88.32171 34.4693...
## 10 20 POLYGON ((-89.24072 34.5017...
Variables in the simple feature data frame include police expenditures (POLICE), crime (CRIME), income (INC), unemployment (UNEMP) and other socio-economic characteristics across Mississippi at the county level. Police expenditures are per person 1982 (dollars per person). Personal income is per person in 1982 (dollars per person). Crime is the number of serious crimes per 100,000 person in 1981. Unemployment is percent of people looking for work in 1980.
The geometries are polygons that define the county borders.
library(ggplot2)
ggplot(data = PE.sf) +
geom_sf()
To estimate autocorrelation for any variable in the data frame, you need to first assign the neighbors and weights for each region.
The default options in the spdep::poly2nb() and spdep::nb2listw() result in neighbors defined by ‘queen’ contiguity (polygon intersections can include a single point) and weights defined by row standardization (the sum of the weights equals the number of regions).
nbs <- spdep::poly2nb(PE.sf)
wts <- spdep::nb2listw(nbs)Alternatively you can specify the number of neighbors and then assign neighbors based on proximity (closeness). Here you first extract the coordinates of the polygon centroids as a matrix.
coords <- PE.sf |>
sf::st_centroid() |>
sf::st_coordinates()## Warning in st_centroid.sf(PE.sf): st_centroid assumes attributes are constant
## over geometries of x
head(coords)## X Y
## 1 -88.56938 34.88746
## 2 -88.23073 34.74665
## 3 -88.89928 34.77600
## 4 -89.17942 34.82323
## 5 -89.98973 34.88099
## 6 -89.49699 34.76984
Then use the spdep::knearneigh() function on the coordinate matrix and specify the number of neighbors with the k = argument. Here you set it to six. That is, allow each county to have 6 closest neighbors.
Since the CRS is geographic you need to include the longlat = TRUE argument so distances are calculated using great circles.
knn <- spdep::knearneigh(coords,
k = 6,
longlat = TRUE)
names(knn)## [1] "nn" "np" "k" "dimension" "x"
head(knn$nn)## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 9 3 2 4 10 14
## [2,] 9 1 16 3 14 10
## [3,] 4 10 1 9 6 14
## [4,] 3 6 10 1 11 9
## [5,] 8 7 6 12 11 4
## [6,] 4 8 11 5 3 10
The output is a list of five elements with the first element a matrix with the row dimension the number of counties and the column dimension the number of neighbors.
Note that by using distance to define neighbors the matrix is not symmetric. For example, county 3 is a neighbor of county 2, but county 2 is not a neighbor of county 3.
Certain spatial models require the neighbor matrix to be symmetric. That is if region X is a neighbor of region Y then region Y must be a neighbor of region X.
You turn this matrix into a neighbor object (class nb) with the spdep::knn2nb() function.
nbs2 <- spdep::knn2nb(knn)
summary(nbs2)## Neighbour list object:
## Number of regions: 82
## Number of nonzero links: 492
## Percentage nonzero weights: 7.317073
## Average number of links: 6
## Non-symmetric neighbours list
## Link number distribution:
##
## 6
## 82
## 82 least connected regions:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 with 6 links
## 82 most connected regions:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 with 6 links
If you include the argument sym = TRUE in the knn2nb() function then it forces the neighbor matrix to be symmetric.
nbs3 <- spdep::knn2nb(knn,
sym = TRUE)
summary(nbs3)## Neighbour list object:
## Number of regions: 82
## Number of nonzero links: 568
## Percentage nonzero weights: 8.447353
## Average number of links: 6.926829
## Link number distribution:
##
## 6 7 8 9 10
## 37 25 13 3 4
## 37 least connected regions:
## 1 2 5 6 7 8 13 16 18 21 22 28 31 33 34 40 42 44 45 46 48 49 50 54 57 59 62 63 65 73 74 76 77 78 80 81 82 with 6 links
## 4 most connected regions:
## 10 19 66 70 with 10 links
The result shows that six is now the minimum number of nearest neighbors with some counties having has many as 10 neighbors to guarantee symmetry.
Compare the default adjacency neighborhoods with the nearest-neighbor neighborhoods.
plot(sf::st_geometry(PE.sf), border = "grey")
plot(nbs, coords, add = TRUE)
plot(sf::st_geometry(PE.sf), border = "grey")
plot(nbs2, coords, add = TRUE)
Toggle between the plots.
A difference between the two neighborhoods is the number of links on counties along the borders. The nearest-neighbor defined neighborhoods have more links. Note: when neighbors are defined by proximity counties can share a border but they still may not be neighbors.
Your choice of neighbors is based on domain specific knowledge. If the process you are interested in can be described by a dispersal mechanism then proximity definition might be the right choice for defining neighbors. If the process can be described by a border diffusion mechanism then contiguity might be the right choice.
Create weight matrices for these alternative neighborhoods using the same spdep::nb2listw() function.
wts2 <- spdep::nb2listw(nbs2)
wts3 <- spdep::nb2listw(nbs3)You compute Moran’s I for the percentage of white people variable (WHITE) with the moran() function separately for the three different weight matrices.
spdep::moran(PE.sf$WHITE,
listw = wts,
n = length(nbs),
S0 = spdep::Szero(wts))## $I
## [1] 0.5634778
##
## $K
## [1] 2.300738
spdep::moran(PE.sf$WHITE,
listw = wts2,
n = length(nbs2),
S0 = spdep::Szero(wts2))## $I
## [1] 0.5506132
##
## $K
## [1] 2.300738
spdep::moran(PE.sf$WHITE,
listw = wts3,
n = length(nbs3),
S0 = spdep::Szero(wts3))## $I
## [1] 0.5592557
##
## $K
## [1] 2.300738
Values of Moran’s I are constrained between -1 and +1. In this case the neighborhood definition has little or no impact on inferences made about spatial autocorrelation. The kurtosis is between 2 and 4 consistent with a set of values from a normal distribution.
In a similar way you compute the Geary’s C statistic.
spdep::geary(PE.sf$WHITE,
listw = wts,
n = length(nbs),
S0 = spdep::Szero(wts),
n1 = length(nbs) - 1)## $C
## [1] 0.4123818
##
## $K
## [1] 2.300738
Values of Geary’s C range between 0 and 2 with values less than one indicating positive autocorrelation.
If the values of Moran’s I and Geary’s C result in different interpretations about the amount of clustering then it is a good idea to examine local variations in autocorrelation.
Assessing the statistical significance of autocorrelation
Attribute values randomly placed across a spatial domain will result in some autocorrelation. Statistical tests provide a way to guard against being fooled by this randomness. For example, claiming a ‘hot spot’ when none exists. In statistical parlance, is the value of Moran’s I significant with respect to the null hypothesis of no autocorrelation?
One way to answer this question is to draw an uncertainty band on the regression line in a Moran scatter plot. If a horizontal line can be placed entirely within the band then the slope (Moran’s I) is not significant against the null hypothesis of no autocorrelation.
More formally the question is answered by comparing the standard deviate (\(z\) value) of the I statistic to the appropriate value from a standard normal distribution. This is done using the spdep::moran.test() function, where the \(z\) value is the difference between I and the expected value of I divided by the square root of the variance of I.
The function takes a variable name or numeric vector and a spatial weights list object in that order. The argument randomisation = FALSE means the variance of I is computed under the assumption of normally distributed unemployment (UNEMP) rates.
( mt <- spdep::moran.test(PE.sf$UNEMP,
listw = wts,
randomisation = FALSE) )##
## Moran I test under normality
##
## data: PE.sf$UNEMP
## weights: wts
##
## Moran I statistic standard deviate = 3.4102, p-value = 0.0003246
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.217503452 -0.012345679 0.004542775
Moran’s I is .218 with a variance of .0045. The \(z\) value for I is 3.41 giving a \(p\)-value of .0003 under the null hypothesis of no autocorrelation. Thus you reject the null hypothesis and conclude there is weak but significant autocorrelation in unemployment rates across Mississippi at the county level.
Outputs from the spdep::moran.test() function are in the form of a list.
str(mt)## List of 6
## $ statistic : Named num 3.41
## ..- attr(*, "names")= chr "Moran I statistic standard deviate"
## $ p.value : num 0.000325
## $ estimate : Named num [1:3] 0.2175 -0.01235 0.00454
## ..- attr(*, "names")= chr [1:3] "Moran I statistic" "Expectation" "Variance"
## $ alternative: chr "greater"
## $ method : chr "Moran I test under normality"
## $ data.name : chr "PE.sf$UNEMP \nweights: wts \n"
## - attr(*, "class")= chr "htest"
The list element called estimate is a vector of length three containing Moran’s I, the expected value of Moran’s I under the assumption of no autocorrelation, and the variance of Moran’s I.
The \(z\) value is the difference between I and it’s expected value divided by the square root of the variance.
( mt$estimate[1] - mt$estimate[2] ) / sqrt(mt$estimate[3])## Moran I statistic
## 3.410219
The \(p\)-value is the area under a standard normal distribution curve to the right (lower.tail = FALSE) of 3.4102 (mt$statistic).
pnorm(mt$statistic,
lower.tail = FALSE)## Moran I statistic standard deviate
## 0.000324554
curve(dnorm(x), from = -4, to = 4, lwd = 2)
abline(v = mt$statistic, col = 'red')
So about .03% of the area lies to the right of the red line.
Recall the \(p\)-value summarizes the evidence in support of the null hypothesis. The smaller the \(p\)-value, the less evidence there is in support of the null hypothesis.
In this case it is the probability that the county unemployment rates could have been arranged at random across the state if the null hypothesis is true. The small \(p\)-value tells you that the spatial arrangement of the data is unusual with respect to the null hypothesis.
The interpretation of the \(p\)-value is stated as evidence AGAINST the null hypothesis. This is because interest lies in the null hypothesis being untenable. A \(p\)-value less than .01 is said to provide convincing evidence against the null, a \(p\)-value between .01 and .05 is said to provide moderate evidence against the null, and a \(p\)-value between .05 and .15 is said to be suggestive, but inconclusive in providing evidence against the null. A \(p\)-value greater than .15 is said to provide no evidence against the null.
Note you do not interpret “no evidence” as “no effect (no autocorrelation)”.
Under the assumption of normal distributed and uncorrelated data, the expected value for Moran’s I is -1/(n-1) where n is the number of regions.
A check on the distribution of unemployment rates indicates that normality is somewhat suspect. A good way to check the normality assumption is to use the sm.density() function from the {sm} package.
if(!require(sm)) install.packages("sm", repos = "http://cran.us.r-project.org")## Loading required package: sm
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
sm::sm.density(PE.sf$UNEMP,
model = "Normal",
xlab = "Unemployment Rates")
The unemployment rates are less “peaked” (lower kurtosis) than a normal distribution. In this case it is better to use the default randomisation = TRUE argument.
Further, the assumptions underlying Moran’s test are sensitive to the form of the graph of neighbor relationships and other factors so results should be checked against a test that involves permutations.
A random sampling approach to inference is made with the spdep::moran.mc() function. MC stands for Monte Carlo which refers to the city of Monte Carlo in Monaco famous for its gambling casinos.
The name of the data vector and the weights list object (listw) are required as is the number of permutations (nsim). Each permutation is a random rearrangement of the unemployment rates across the counties. This removes the spatial autocorrelation but keeps the non-spatial distribution of the unemployment rates. The neighbor topology and weights remain the same.
For each permutation (random shuffle of the data values), I is computed and saved. The \(p\)-value is obtained as the ratio of the number of permuted I values greater or exceeding the observed I over the number of permutation plus one. In the case where there are 5 permuted I values greater or equal to the observed value based on 99 simulations, the \(p\)-value is 5/(99 + 1) = .05.
For example, if you want inference on I using 9999 permutations type
set.seed(40453)
( mP <- spdep::moran.mc(PE.sf$UNEMP,
listw = wts,
nsim = 9999) )##
## Monte-Carlo simulation of Moran I
##
## data: PE.sf$UNEMP
## weights: wts
## number of simulations + 1: 10000
##
## statistic = 0.2175, observed rank = 9991, p-value = 9e-04
## alternative hypothesis: greater
Nine of the permutations yield a Moran’s I greater than .218, hence the \(p\)-value as evidence in support of the null hypothesis (the true value for Moran’s I is zero) is .0009.
Note: you initiate the random number generator with a seed value (any will do) so that the set of random permutations of the values across the domain will be the same each time you run this code chunk. This is important for reproducibility. The default random number generator seed value is determined from the current time (internal clock) and so no random permutations will be identical. To control the seed use the set.seed() function.
The values of I computed for each permutation are saved in the vector mP$res.
head(mP$res)## [1] -0.03052409 0.05019765 0.01346706 0.03189984 -0.07625158 -0.07398726
tail(mP$res)## [1] 0.01973190 -0.01000012 -0.04472215 -0.12488347 -0.01269481 0.21750345
The last value in the vector is I computed using the data in the correct counties. The \(p\)-value as evidence in support of the null hypothesis that I is zero is given as
sum(mP$res > mP$res[10000])/9999## [1] 0.00090009
A density graph displays the distribution of permuted I’s.
df <- data.frame(mp = mP$res[-10000])
ggplot(data = df,
mapping = aes(mp)) +
geom_density() +
geom_rug() +
geom_vline(xintercept = mP$res[10000],
color = "red", size = 2) +
theme_minimal()
The density curve is centered just to the left of zero consistent with the theoretical expectation (mean).
What to do with the knowledge that the unemployment rates have significant autocorrelation? By itself, not much, but it can provide notice that something might be going on in certain regions (hot spot analysis).
The knowledge is useful after other factors are considered. In the language of statistics, knowledge of significant autocorrelation in the model residuals can help you build a better model.
Bivariate spatial autocorrelation
The idea of spatial autocorrelation can be extended to two variables. It is motivated by the fact that aspatial bi-variate association measures, like Pearson’s correlation, do not recognize the spatial arrangement of the regions.
Consider the correlation between police expenditure (POLICE) and the amount of crime (CRIME) in the police expenditure data set.
police <- PE.sf$POLICE
crime <- PE.sf$CRIME
cor.test(police, crime, method = "pearson")##
## Pearson's product-moment correlation
##
## data: police and crime
## t = 6.2916, df = 80, p-value = 1.569e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4094516 0.7043990
## sample estimates:
## cor
## 0.5753377
You note a significant (direct) correlation (\(p\)-value < .01) exists between these two variables.
But you also note some significant spatial autocorrelation in each of the variables separately.
spdep::moran.test(police,
listw = wts)##
## Moran I test under randomisation
##
## data: police
## weights: wts
##
## Moran I statistic standard deviate = 1.7899, p-value = 0.03674
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.087185424 -0.012345679 0.003092257
spdep::moran.test(crime,
listw = wts)##
## Moran I test under randomisation
##
## data: crime
## weights: wts
##
## Moran I statistic standard deviate = 2.2072, p-value = 0.01365
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.103588680 -0.012345679 0.002758842
The Lee statistic integrates the Pearson correlation as an aspatial bi-variate association metric with Moran’s I as a uni-variate spatial autocorrelation metric. The formula is \[ L(x,y) = \frac{n}{\sum_{i=1}^{n}(\sum_{j=1}^{n}w_{ij})^2} \frac{\sum_{i=1}^{n}(\sum_{j=1}^{n}w_{ij}(x_i-\bar{x})) ((\sum_{j=1}^{n}w_{ij}(y_j-\bar{y}))}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} \]
The formula is implemented in the spdep::lee() function where the first two arguments are the variables of interest and you need to include the weights matrix and the number of regions. The output from this function is a list of two with the first being the value of Lee’s statistic (L).
spdep::lee(crime, police,
listw = wts,
n = length(nbs))$L## [1] 0.1306991
Values of L range between -1 and +1 with the value here of .13 indicating relatively weak bi-variate spatial autocorrelation between crime and police expenditures. Statistically you infer that crime in a county has some influence on police expenditure in that county and in the neighboring counties, but not much.
The crime and police variables are not adequately described by a normal distribution.
par(mfrow = c(2, 1))
sm::sm.density(crime, model = "normal")
sm::sm.density(police, model = "normal")
Thus you perform a non-parametric test on the bi-variate spatial autocorrelation with the spdep::lee.mc() function. The crime and police expenditure values are randomly permuted and values of L are computed for each permutation.
spdep::lee.mc(crime, police,
listw = wts,
nsim = 999)##
## Monte-Carlo simulation of Lee's L
##
## data: crime , police
## weights: wts
## number of simulations + 1: 1000
##
## statistic = 0.1307, observed rank = 760, p-value = 0.24
## alternative hypothesis: greater
Based on a \(p\)-value that exceeds .05 you conclude that there is no significant bi-variate spatial autocorrelation between crime and police expenditure in these data.
Local indicators of spatial autocorrelation
The Moran’s I statistic was first used in the 1950s. Localization of the statistic was presented by Luc Anselin in 1995 (Anselin, L. 1995. Local indicators of spatial association, Geographical Analysis, 27, 93–115).
Earlier you saw the raster::MoranLocal() function from the {raster} package returns a raster of local Moran’s I values.
Local I is a deconstruction of global I where geographic proximity is used in two ways. (1) to define and weight neighbors and (2) to determine the spatial scale over which I is computed.
Using queen’s contiguity you determine the neighborhood topology and the weights for the police expenditure data from Mississippi. Here you print them in the full matrix form with the spdep::list2mat() function.
round(spdep::listw2mat(wts)[1:5, 1:10], 2)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## 1 0.00 0.33 0.33 0.00 0 0.00 0.00 0.00 0.33 0.00
## 2 0.33 0.00 0.00 0.00 0 0.00 0.00 0.00 0.33 0.00
## 3 0.25 0.00 0.00 0.25 0 0.00 0.00 0.00 0.25 0.25
## 4 0.00 0.00 0.33 0.00 0 0.33 0.00 0.00 0.00 0.33
## 5 0.00 0.00 0.00 0.00 0 0.33 0.33 0.33 0.00 0.00
The matrix shows that the first county has three neighbors 2, 3, and 9 and each get a weight of 1/3. The third county has four neighbors 1, 4, 9 and 10 and each gets a weight of 1/4.
Compute local Moran’s I on the percentage of white people using the spdep::localmoran() function. Two arguments are needed (1) the attribute variable for which you want to compute local correlation and (2) the weights matrix as a list object.
Ii_stats <- spdep::localmoran(PE.sf$WHITE,
listw = wts)
str(Ii_stats)## 'localmoran' num [1:82, 1:5] 2.28138 2.97475 1.31244 0.00231 -1.03216 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:82] "1" "2" "3" "4" ...
## ..$ : chr [1:5] "Ii" "E.Ii" "Var.Ii" "Z.Ii" ...
## - attr(*, "call")= language spdep::localmoran(x = PE.sf$WHITE, listw = wts)
## - attr(*, "quadr")='data.frame': 82 obs. of 3 variables:
## ..$ mean : Factor w/ 4 levels "Low-Low","High-Low",..: 4 4 4 4 2 3 1 1 4 4 ...
## ..$ median: Factor w/ 4 levels "Low-Low","High-Low",..: 4 4 4 3 2 3 1 1 4 4 ...
## ..$ pysal : Factor w/ 4 levels "Low-Low","High-Low",..: 4 4 4 4 2 3 1 1 4 4 ...
The local I is stored in the first column of a matrix where the rows are the counties. The other columns are the expected value for I, the variance of I, the \(z\) value and the \(p\)-value. For example, the local I statistics from the first county are given by typing
head(Ii_stats)## Ii E.Ii Var.Ii Z.Ii Pr(z != E(Ii))
## 1 2.281375143 -2.748824e-02 7.124247e-01 2.735450 0.006229509
## 2 2.974750377 -4.354053e-02 1.109833e+00 2.865051 0.004169423
## 3 1.312440365 -1.827251e-02 3.539514e-01 2.236725 0.025304339
## 4 0.002313108 -2.007906e-07 5.351069e-06 1.000031 0.317295645
## 5 -1.032155817 -1.511126e-02 3.966295e-01 -1.614907 0.106330864
## 6 -0.493034653 -8.356103e-03 1.291002e-01 -1.348933 0.177358557
Because these local values must average to the global value (when using row standardized weights), they can take on values outside the range between -1 and 1. A summary() method on the first column of the Li object gives statistics from the non-spatial distribution of I’s.
summary(Ii_stats[, 1])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.03216 0.01733 0.26984 0.56348 1.05945 2.97475
You map the values by first attaching the matrix columns of interest to the simple feature data frame. Here you attach Ii, Var, and Pi.
PE.sf$Ii <- Ii_stats[, 1]
PE.sf$Vi <- Ii_stats[, 3]
PE.sf$Pi <- Ii_stats[, 5]Then using the {ggplot2} syntax.
( g1 <- ggplot(data = PE.sf) +
geom_sf(aes(fill = Ii)) +
scale_fill_gradient2(low = "green",
high = "blue") )
You also map out the variances.
ggplot(data = PE.sf) +
geom_sf(aes(fill = Vi)) +
scale_fill_gradient()
Variances are larger for counties near the boundaries as the sample sizes are smaller.
Compare the map of local autocorrelation with a map of percent white.
( g2 <- ggplot(data = PE.sf) +
geom_sf(aes(fill = WHITE)) +
scale_fill_gradient(low = "black",
high = "white") )
Plot them together.
library(patchwork)
g1 + g2
Areas where percent white is high over the northeast are areas with the largest spatial correlation. Other areas of high spatial correlation include the Mississippi Valley and in the south. Note the county with the most negative spatial correlation is the county in the northwest with a fairly high percentage of whites neighbored by counties with much lower percentages of whites.
Local values of Lee’s bi-variate spatial autocorrelation are available from the spdep::lee() function.
lee_stat <- spdep::lee(crime, police,
listw = wts,
n = length(nbs))
PE.sf$localL <- lee_stat$localL
tmap::tm_shape(PE.sf) +
tmap::tm_fill("localL",
title = "") +
tmap::tm_borders(col = "gray70") +
tmap::tm_layout(title = "Local bi-variate spatial autocorrelation",
legend.outside = TRUE)## Variable(s) "localL" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Areas in dark green indicate where the correlation between crime and policing is most influenced by neighboring crime and policing.
Population and tornado reports
Is the frequency of tornado reports correlated with the number of people in a region? Might this correlation extend to the number of people in neighboring region?
To answer these questions you quantify the non-spatial correlation and the bi-variate spatial autocorrelation between tornado occurrences and population. To keep this manageable you focus on one state (Iowa).
Start by getting the U.S. Census data with functions from the {tidycensus} package. Downloading U.S. census data using functions from the {tidycensus} package requires you register with the Census Bureau.
You can get an API key from http://api.census.gov/data/key_signup.html. Then use the tidycensus::census_api_key() function and put your key in quotes.
tidycensus::census_api_key("YOUR API KEY GOES HERE")The get_decennial() function grants access to the 1990, 2000, and 2010 decennial US Census data and the get_acs() function grants access to the 5-year American Community Survey data. For example, here is how you get county-level population for Iowa.
Counties.sf <- tidycensus::get_acs(geography = "county",
variables = "B02001_001E",
state = "IA",
geometry = TRUE)## Getting data from the 2016-2020 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|===================== | 31%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 39%
|
|============================ | 40%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================ | 64%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|========================================================== | 84%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
The code returns a simple feature data frame with county borders as multi-polygons. The variable B02001_001E is the 2015-2019 population estimate in each county within the state.
Next get the tornado data and count the number of tracks by county. A single track can intersect more than one county.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-aspath"),
layer = "1950-2020-torn-aspath") |>
sf::st_transform(crs = sf::st_crs(Counties.sf)) |>
dplyr::filter(yr >= 2015)## Reading layer `1950-2020-torn-aspath' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-aspath'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: LINESTRING
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
( TorCounts.df <- Torn.sf |>
sf::st_intersection(Counties.sf) |>
sf::st_drop_geometry() |>
dplyr::group_by(GEOID) |>
dplyr::summarize(nT = dplyr::n()) )## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
## # A tibble: 89 × 2
## GEOID nT
## <chr> <int>
## 1 19001 8
## 2 19003 4
## 3 19007 6
## 4 19011 7
## 5 19013 4
## 6 19015 7
## 7 19017 1
## 8 19019 4
## 9 19021 1
## 10 19023 1
## # … with 79 more rows
Next join the counts to the simple feature data frame.
Counties.sf <- Counties.sf |>
dplyr::left_join(TorCounts.df,
by = "GEOID") |>
dplyr::mutate(nT = tidyr::replace_na(nT, 0)) |>
dplyr::mutate(Area = sf::st_area(Counties.sf),
rate = nT/Area/(2020 - 2015 + 1) * 10^10,
lpop = log10(estimate))Note that some counties have no tornadoes and the dplyr::left_join() returns a value of NA for those. You use dplyr::mutate() with tidyr::replace_na() to turn those counts to a value of 0.
Make a two-panel map displaying the log of the population and the tornado rates.
map1 <- tmap::tm_shape(Counties.sf) +
tmap::tm_borders(col = "gray70") +
tmap::tm_fill(col = "lpop",
title = "Log Population",
palette = "Blues") +
tmap::tm_layout(legend.outside = "TRUE")
map2 <- tmap::tm_shape(Counties.sf) +
tmap::tm_borders(col = "gray70") +
tmap::tm_fill(col = "rate",
title = "Annual Rate\n[/10,000 sq. km]",
palette = "Greens") +
tmap::tm_layout(legend.outside = "TRUE")
tmap::tmap_arrange(map1, map2)
There appears some relationship. The non-spatial correlation between the two variables is obtained with the cor.test() function.
lpop <- Counties.sf$lpop
rate <- as.numeric(Counties.sf$rate)
cor.test(lpop, rate)##
## Pearson's product-moment correlation
##
## data: lpop and rate
## t = 3.8566, df = 97, p-value = 0.0002069
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1801804 0.5242953
## sample estimates:
## cor
## 0.3646227
The bi-variate spatial autocorrelation is assessed using the Lee statistic. A formal non-parametric test under the null hypothesis of no bi-variate spatial autocorrelation is done using a Monte Carlo simulation.
nbs <- spdep::poly2nb(Counties.sf)
wts <- spdep::nb2listw(nbs)
lee_stat <- spdep::lee(lpop, rate,
listw = wts,
n = length(nbs))
lee_stat$L## [1] 0.2056744
spdep::lee.mc(lpop, rate, listw = wts, nsim = 9999)##
## Monte-Carlo simulation of Lee's L
##
## data: lpop , rate
## weights: wts
## number of simulations + 1: 10000
##
## statistic = 0.20567, observed rank = 10000, p-value = 1e-04
## alternative hypothesis: greater
Finally you map out the local variation in the bi-variate spatial autocorrelation.
Counties.sf$localL <- lee_stat$localL
tmap::tm_shape(Counties.sf) +
tmap::tm_fill("localL",
title = "Local Bivariate\nSpatial Autocorrelation") +
tmap::tm_borders(col = "gray70") +
tmap::tm_layout(legend.outside = TRUE)## Variable(s) "localL" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

What might cause this? Compare with Kansas.
Also, compare local Lee with local Moran.
Ii_stats <- spdep::localmoran(rate,
listw = wts)
Counties.sf$localI = Ii_stats[, 1]
tmap::tm_shape(Counties.sf) +
tmap::tm_borders(col = "gray70") +
tmap::tm_fill(col = "localI",
title = "Local Autocorrelation",
palette = "Purples") +
tmap::tm_layout(legend.outside = "TRUE")
Tuesday October 4, 2022
“The most important single aspect of software development is to be clear about what you are trying to build.” – Bjarne Stroustrup
Today
- Constraining group membership based on spatial autocorrelation
- Estimating spatial autocorrelation in model residuals
- Choosing a spatial regression model
Constraining group membership based on spatial autocorrelation
As a spatial data analyst you likely will face the situation in which there are many variables and you need to group them in a way that minimizes inter-group variation but maximizes between-group variation. If you know the number of groups a priori then a common grouping (or clustering) method is called K-means.
If your data is spatial you will want the additional constraint that the resulting groups be geographically linked. In fact there are many situations that require separating geographies into discrete but contiguous regions (regionalization) such as designing communities, planning areas, amenity zones, logistical units, or even for the purpose of setting up experiments with real world geographic constraints.
There are many situations where the optimal grouping using traditional cluster metrics is sub-optimal in practice because of these geographic constraints.
Unconstrained grouping on data with spatial characteristics may result in contiguous regions because of autocorrelation, but if you want to ensure that all groups are spatially-contiguous you need a method specifically designed for the task. The ‘skater’ algorithm available in the {spdep} package is well-implemented and well-documented.
The ‘skater’ algorithm (spatial ’k’luster analysis by tree edge removal) builds a connectivity graph to represent spatial relationships between neighboring areas, where each area is represented by a node and edges represent connections between areas. Edge costs are calculated by evaluating the dissimilarity in attribute space between neighboring areas. The connectivity graph is reduced by pruning edges with higher dissimilarity.
Consider again the crime data at the tract level in the city of Columbus, Ohio. The tract polygons are projected with arbitrary spatial coordinates.
( CC.sf <- sf::st_read(dsn = here::here("data", "columbus"),
layer = "columbus") )## Reading layer `columbus' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/columbus'
## using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## First 10 features:
## AREA PERIMETER COLUMBUS_ COLUMBUS_I POLYID NEIG HOVAL INC CRIME
## 1 0.309441 2.440629 2 5 1 5 80.467 19.531 15.725980
## 2 0.259329 2.236939 3 1 2 1 44.567 21.232 18.801754
## 3 0.192468 2.187547 4 6 3 6 26.350 15.956 30.626781
## 4 0.083841 1.427635 5 2 4 2 33.200 4.477 32.387760
## 5 0.488888 2.997133 6 7 5 7 23.225 11.252 50.731510
## 6 0.283079 2.335634 7 8 6 8 28.750 16.029 26.066658
## 7 0.257084 2.554577 8 4 7 4 75.000 8.438 0.178269
## 8 0.204954 2.139524 9 3 8 3 37.125 11.337 38.425858
## 9 0.500755 3.169707 10 18 9 18 52.600 17.586 30.515917
## 10 0.246689 2.087235 11 10 10 10 96.400 13.598 34.000835
## OPEN PLUMB DISCBD X Y NSA NSB EW CP THOUS NEIGNO
## 1 2.850747 0.217155 5.03 38.80 44.07 1 1 1 0 1000 1005
## 2 5.296720 0.320581 4.27 35.62 42.38 1 1 0 0 1000 1001
## 3 4.534649 0.374404 3.89 39.82 41.18 1 1 1 0 1000 1006
## 4 0.394427 1.186944 3.70 36.50 40.52 1 1 0 0 1000 1002
## 5 0.405664 0.624596 2.83 40.01 38.00 1 1 1 0 1000 1007
## 6 0.563075 0.254130 3.78 43.75 39.28 1 1 1 0 1000 1008
## 7 0.000000 2.402402 2.74 33.36 38.41 1 1 0 0 1000 1004
## 8 3.483478 2.739726 2.89 36.71 38.71 1 1 0 0 1000 1003
## 9 0.527488 0.890736 3.17 43.44 35.92 1 1 1 0 1000 1018
## 10 1.548348 0.557724 4.33 47.61 36.42 1 1 1 0 1000 1010
## geometry
## 1 POLYGON ((8.624129 14.23698...
## 2 POLYGON ((8.25279 14.23694,...
## 3 POLYGON ((8.653305 14.00809...
## 4 POLYGON ((8.459499 13.82035...
## 5 POLYGON ((8.685274 13.63952...
## 6 POLYGON ((9.401384 13.5504,...
## 7 POLYGON ((8.037741 13.60752...
## 8 POLYGON ((8.247527 13.58651...
## 9 POLYGON ((9.333297 13.27242...
## 10 POLYGON ((10.08251 13.03377...
First, create choropleth maps of housing value, income, and crime.
tmap::tm_shape(CC.sf) +
tmap::tm_fill(col = c("HOVAL", "INC", "CRIME"))## Warning: Currect projection of shape CC.sf unknown. Long-lat (WGS84) is assumed.

The maps show distinct regional patterns. Housing values and income are clustered toward the southeast and crime is clustered in the center. But although housing values are also high in the north you don’t necessarily want to group that tract with those in the southeast because they are geographically distinct.
To group these patterns under the constraint of spatial contiguity you first scale the attribute values and center them using the scale() function. Scaling and centering variables should be done with any clustering approaches.
( CCs.df <- CC.sf |>
dplyr::mutate(HOVAL = scale(HOVAL),
INC = scale(INC),
CRIME = scale(CRIME)) |>
dplyr::select(HOVAL, INC, CRIME) |>
sf::st_drop_geometry() )## HOVAL INC CRIME
## 1 2.27610855 0.90403637 -1.15961852
## 2 0.33200225 1.20228067 -0.97579369
## 3 -0.65450986 0.27721488 -0.26906635
## 4 -0.28355918 -1.73545197 -0.16382075
## 5 -0.82373916 -0.54755948 0.93250061
## 6 -0.52454175 0.29001413 -0.54160387
## 7 1.98005188 -1.04095129 -2.08883352
## 8 -0.07100723 -0.53265603 0.19704853
## 9 0.76701615 0.56301041 -0.27569218
## 10 3.13893423 -0.13622431 -0.06741470
## 11 -1.01462975 -1.21120127 1.62242856
## 12 -1.00379913 -0.75866244 1.28954855
## 13 0.17674452 -0.84615445 0.69251980
## 14 0.24172862 -0.77356589 1.31109176
## 15 -1.10669054 -0.78934601 0.80424271
## 16 -1.06336790 -1.18349839 1.17796908
## 17 0.17945213 -0.80249612 0.10398880
## 18 1.16775124 -0.20863754 0.52794726
## 19 -0.42435801 -0.48338699 1.15903863
## 20 2.31943098 2.92722331 -2.08611253
## 21 -0.99973763 -0.65223429 0.29555480
## 22 -0.43248096 -0.46743153 -0.08509252
## 23 0.50345189 1.18878008 -0.90128119
## 24 0.79950834 -0.02436078 0.18939933
## 25 -1.11210588 -1.03691859 1.56408123
## 26 -0.98213783 -1.10284443 0.34908475
## 27 -0.23482130 -0.62295340 1.05579183
## 28 -0.84404667 -1.14299607 1.30234528
## 29 -0.32146659 -0.99834496 1.53128622
## 30 -0.86300035 -0.08222123 2.01787200
## 31 -0.35937401 0.44974438 -1.04300226
## 32 0.10092968 0.80076407 -0.95524408
## 33 -0.80343164 -0.78145595 0.40875576
## 34 -0.54078771 0.10047751 -0.66667072
## 35 -0.61931016 -0.27368671 0.24182446
## 36 -0.11568382 0.76517130 -1.24451072
## 37 0.26338981 0.46324498 0.43725866
## 38 -0.85216962 -0.57298301 1.11056729
## 39 0.06302227 0.71923344 -0.95791733
## 40 1.27335038 2.71033430 -1.12882028
## 41 0.19840570 1.37323217 -0.96961443
## 42 0.31933030 2.01600877 -1.11384361
## 43 -0.68970950 -0.17444727 0.09172721
## 44 -0.26731322 0.45342623 -0.54784308
## 45 -0.57961574 -0.04206959 -0.36458895
## 46 2.03962048 0.69240723 -1.11153410
## 47 0.22006716 0.80216710 -0.43664372
## 48 -0.63014089 -0.44919672 -0.50702314
## 49 -0.14276051 0.77516538 -0.75228685
Next create adjacency neighbors using queen contiguity.
nbs <- spdep::poly2nb(CC.sf,
queen = TRUE)
plot(CC.sf$geometry)
plot(nbs,
sf::st_centroid(sf::st_geometry(CC.sf)),
add = TRUE)
Next combine the contiguity graph with your scaled attribute data to calculate edge costs based on the distance between each node. The function spdep::nbcosts() provides distance methods for Euclidean, Manhattan, Canberra, binary, Minkowski, and Mahalanobis, and defaults to Euclidean if not specified.
costs <- spdep::nbcosts(nbs,
data = CCs.df)Next transform the edge costs into spatial weights using the spdep::nb2listw() function before constructing the minimum spanning tree with the weights list.
wts <- spdep::nb2listw(nbs,
glist = costs,
style = "B")
mst <- spdep::mstree(wts)
head(mst)## [,1] [,2] [,3]
## [1,] 12 16 0.4432652
## [2,] 16 25 0.4158649
## [3,] 25 28 0.3893763
## [4,] 16 11 0.4479811
## [5,] 16 15 0.5448893
## [6,] 15 5 0.3936652
Edges with higher dissimilarity are removed leaving a set of nodes and edges that take the minimum sum of dissimilarities across all edges of the tree (a minimum spanning tree).
The edge connecting node (tract) 33 with node (tract) 35 has a dissimilarity of .56 units. The edge connecting tract 35 with tract 43 has a dissimilarity of .19 units.
Finally, the spdep::skater() function partitions the graph by identifying which edges to remove based on dissimilarity while maximizing the between-group variation. The ncuts = argument specifies the number of partitions to make, resulting in ncuts + 1 groups.
clus5 <- spdep::skater(edges = mst[,1:2],
data = CCs.df,
ncuts = 4)Where are these groups located?
CC.sf <- CC.sf |>
dplyr::mutate(Group = clus5$groups)
library(ggplot2)
ggplot() +
geom_sf(data = CC.sf,
mapping = aes(fill = factor(Group)))
The map shows five distinct regions based on the three variables of income, housing value, and crime. Importantly the regions are contiguous.
As a comparison, here is the result of grouping the same three variables using hierarchical clustering using the method of minimum variance (Ward) and without regard to spatial contiguity.
dd <- dist(CCs.df)
hc <- hclust(dd,
method = "ward.D")
hcGroup <- cutree(hc, k = 5)
CC.sf <- CC.sf |>
dplyr::mutate(hcGroup = hcGroup)
ggplot() +
geom_sf(data = CC.sf,
mapping = aes(fill = factor(hcGroup)))
Here the map shows five regions but the regions are not contiguous.
More information: https://www.tandfonline.com/doi/abs/10.1080/13658810600665111
Estimating spatial autocorrelation in model residuals
A spatial regression model should be considered for your data whenever the residuals resulting from a aspatial regression exhibit spatial autocorrelation. A common way to proceed is to first regress the response variable onto the explanatory variables and check for autocorrelation in the residuals.
If there is significant spatial autocorrelation in the residuals then a spatial regression model should be considered.
Let’s stay with the Columbus crime data and fit a linear regression model with CRIME as the response variable and INC and HOVAL as the explanatory variables. At the level of tracts, how well does income and housing value statistically explain the amount of crime?
model <- lm(CRIME ~ INC + HOVAL,
data = CC.sf)
summary(model)##
## Call:
## lm(formula = CRIME ~ INC + HOVAL, data = CC.sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.418 -6.388 -1.580 9.052 28.649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.6190 4.7355 14.490 < 2e-16 ***
## INC -1.5973 0.3341 -4.780 1.83e-05 ***
## HOVAL -0.2739 0.1032 -2.654 0.0109 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.43 on 46 degrees of freedom
## Multiple R-squared: 0.5524, Adjusted R-squared: 0.5329
## F-statistic: 28.39 on 2 and 46 DF, p-value: 9.341e-09
The model statistically explains 55% of the variation in crime as can be seen by the multiple R-squared value. Looking at the coefficients (values under the Estimate column), you see that higher incomes are associated with lower values of crime (negative coefficient) and higher housing values are associated with lower crime. For every one unit increase in income, crime values decrease by 1.6 units.
Use the residuals() method to extract the vector of residuals from the model.
( res <- residuals(model) )## 1 2 3 4 5 6
## 0.3465419 -3.6947990 -5.2873940 -19.9855151 6.4475490 -9.0734793
## 7 8 9 10 11 12
## -34.4177224 -1.9146840 4.3960594 13.5091017 10.9800573 9.5877236
## 13 14 15 16 17 18
## 4.7728320 16.1128397 0.6675424 3.5491565 -4.6630963 12.8399569
## 19 20 21 22 23 24
## 12.8428644 3.4948724 -6.0537589 -7.8697868 -1.7037730 6.9913819
## 25 26 27 28 29 30
## 11.0984343 -9.1741523 10.8026296 7.1086321 14.9005133 28.6487456
## 31 32 33 34 35 36
## -15.1722792 -8.1776706 -4.3438864 -12.9749799 -1.5798172 -14.4376850
## 37 38 39 40 41 42
## 12.8687861 9.0515532 -9.1569014 12.2449674 -2.7098171 1.3443547
## 43 44 45 46 47 48
## -3.5432909 -6.3880045 -9.4155428 -1.9731210 1.1150296 -15.7632989
## 49
## -6.2476690
There are 49 residuals one for each tract. The residuals are the difference between the observed crime rates and the predicted crime rates (observed - predicted). A residual that has a value greater than 0 indicates that the model under predicts the observed crime rate in that tract and a residual that has a value less than 0 indicates that the model over predicts the observed crime rate.
If you plot the residuals, they should be approximated by a normal distribution. You check this with the sm::sm.density() function with the first argument the vector of residuals (res) and the argument model = set to “Normal”.
sm::sm.density(res,
model = "Normal")
The density curve of the residuals (black line) fits completely within the blue ribbon that defines a normal distribution.
Next create a map of the model residuals. Do the residuals show any pattern of clustering? Since the values in the vector of residuals res are arranged in the same order as the rows in the simple feature data frame you create a new column in the data frame using the $ syntax and calling the new column res.
CC.sf$res <- res
tmap::tm_shape(CC.sf) +
tmap::tm_fill(col = "res") +
tmap::tm_borders(col = "gray70") +
tmap::tm_layout(title = "Linear model residuals")## Warning: Currect projection of shape CC.sf unknown. Long-lat (WGS84) is assumed.
## Variable(s) "res" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

The map shows contiguous tracts with negative residuals across the southwestern and southern part of the city and a group of contiguous tracts with positive residuals toward the center.
The map indicates some clustering but the clustering appears to be less than with the crime values themselves. That is, after accounting for regional factors related to crime, the autocorrelation is reduced.
To determine the amount of autocorrelation in the residuals use the spdep::lm.morantest() function, passing the regression model object and the weights object to it. Note that you once again use the default neighborhood and weighting schemes.
nbs <- spdep::poly2nb(CC.sf)
wts <- spdep::nb2listw(nbs)
spdep::lm.morantest(model,
listw = wts)##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = CC.sf)
## weights: wts
##
## Moran I statistic standard deviate = 2.8393, p-value = 0.00226
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.222109407 -0.033418335 0.008099305
Moran’s I on the model residuals is .22. This compares with the value of .5 on the value of crime alone. Part of the autocorrelation in the crime rates is statistically ‘absorbed’ by the explanatory factors.
But does this output let you know if you need a spatial regression model?
The \(p\)-value on I of .002, thus you reject the null hypothesis of no spatial autocorrelation in the model residuals and conclude that a spatial regression model would improve the fit. The \(z\)-value (as the basis for the \(p\)-value) takes into account the fact that these are residuals from a model so the variance is adjusted accordingly.
Given significant spatial autocorrelation in the model residuals, the next step is to choose the type of spatial regression model.
Choosing a spatial regression model
Ordinary least-squares regression models fit to spatial data can lead to improper inference because observations are not independent. This might lead to poor policy decisions. Thus it’s necessary to check the residuals from an ordinary least-squares model for autocorrelation. If the residuals are strongly correlated the model is not specified properly.
You can try to improve the model by adding variables. If that’s not possible (no additional data, or no clue as to what variable to include), you can try a spatial regression model. Spatial regression models are widely used in econometrics and epidemiology.
The equation for a regression model in vector notation is \[ y = X \beta + \varepsilon \] where \(y\) is a \(n\) by 1 vector of response variable values, \(X\) is a \(n\) by \(p+1\) matrix containing the explanatory variables and augmented by a column of ones for the intercept term, \(\beta\) is a \(p+1\) \(\times\) 1 vector of model coefficients and \(\varepsilon\) is a \(n\) by 1 vector of residuals (iid).
A couple options exist if the elements of the vector \(\varepsilon\) are correlated. One is to include a spatial lag term so the model becomes \[ y = \rho W y + X \beta + \varepsilon \] where \(Wy\) is the weighted average of the neighborhood response values (spatial lag variable) with \(W\) the spatial weights matrix, and \(\rho\) is the autoregression coefficient. This is called a spatial autoregressive (SAR) model.
Note: \(Wy\) is the spatial lag variable you compute with the spdep::lag.listw() function and \(\rho\) is Moran’s I. Thus the model is also called a spatial lag model (SLM).
Justification for the spatial lag model is domain specific but motivated by a ‘diffusion’ process. The response variable \(y_i\) is influenced by the explanatory variables at location \(i\) and by explanatory variables at locations \(j\).
\(\rho Wy\) is called the spatial signal term and \(\beta X\) is called the trend term.
Another option is to include a spatial error term so the model becomes \[ y = X\beta + \lambda W \epsilon + u \] where \(\lambda\) is the autoregression coefficient, \(W\epsilon\) is the spatial error term representing the weighted average of the neighborhood residuals, and \(u\) are the overall residuals assumed to be iid. This is called a spatial error model (SEM).
Here the lag term is computed using the residuals rather the response variable.
Application of the spatial error model is motivated by the omitted variable bias. Suppose the variable \(y\) is statistically described by two variables \(x\) and \(z\) each centered on zero and independent. Then \[ y = \beta x + \theta z \]
If \(z\) is not observed, then the vector \(\theta z\) is nested in the error term \(\epsilon\). \[ y = \beta x + \epsilon \]
Examples of an unobserved latent variable \(z\) include local culture, social capital, neighborhood readiness. Importantly you would expect the latent variable to be spatially correlated (e.g., culture will be similar across neighborhoods), so let \[ z = \lambda W z + r\\ z = (I - \lambda W)^{-1} r \] where \(r\) is a vector of random independent residuals (e.g., culture is similar but not identical), \(W\) is the spatial weights matrix and \(\lambda\) is a scalar spatial correlation parameter. Substituting into the equation above \[ y = \beta x + \theta z \\ y = \beta x + \theta (I - \lambda W)^{-1} r\\ y = \beta x + (I - \lambda W)^{-1} \varepsilon \] where \(\varepsilon = \theta r\).
Another motivation for considering a spatial error model is heterogeneity. Suppose you have multiple observations for each unit. If you want a model that incorporates individual effects you can include a \(n \times 1\) vector \(a\) of individual intercepts for each unit. \[ y = a + X\beta \] where now \(X\) is a \(n\) \(\times\) \(p\) matrix.
In a cross-sectional setting with one observation per unit (typically the case in observational studies), this approach is not possible since you will have more parameters than observations.
Instead you can treat \(a\) as a vector of spatial random effects. You assume that the intercepts follows a spatially smoothed process \[ a = \lambda W a + \epsilon \\ a = (I - \lambda W)^{-1} \epsilon \] which leads to the previous model \[ y = X\beta + (I - \lambda W)^{-1} \epsilon \]
In the absence of domain-specific knowledge of the process that might be responsible for the autocorrelated residuals, you can run some statistical tests on the linear model.
The tests are performed with the spdep::lm.LMtests() function. The LM stands for ‘Lagrange multiplier’ indicating that the technique simultaneously determines the coefficients on the explanatory variables AND the coefficient on the spatial lag variable.
The test type is specified as a character string. The tests should be considered in a sequence starting with the standard versions and moving to the ‘robust’ versions if the choice remains ambiguous.
To perform LM tests you specify the model object, the weights matrix, and the two model types using the test = argument. The model types are specified as character strings "LMerr" and "LMlag" for the spatial error and lag models, respectively.
spdep::lm.LMtests(model,
listw = wts,
test = c("LMerr", "LMlag"))##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = CC.sf)
## weights: wts
##
## LMerr = 5.2062, df = 1, p-value = 0.02251
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = CC.sf)
## weights: wts
##
## LMlag = 8.898, df = 1, p-value = 0.002855
The output shows that both the spatial error and spatial lag models are significant (\(p\)-value < .15). Ideally one model is significant and the other is not, and you choose the model that is significant.
Since both are significant, you test again. This time you use the robust forms of the statistics with character strings "RLMerr" and "RLMlag" in the test = argument.
spdep::lm.LMtests(model,
listw = wts,
test = c("RLMerr", "RLMlag"))##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = CC.sf)
## weights: wts
##
## RLMerr = 0.043906, df = 1, p-value = 0.834
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = CC.sf)
## weights: wts
##
## RLMlag = 3.7357, df = 1, p-value = 0.05326
Here the error model has a large \(p\)-value and the lag model has a \(p\)-value that is less than .15 so you choose the lag model for your spatial regression.
A decision tree (from Luc Anselin) shows the sequence of tests for making a choice about which type of spatial model to use Decision Tree
If both tests show significance models, then you should fit both models and check which one results in the lowest information criteria (AIC).
Another options is to include both a spatial lag term and a spatial error term into a single model.
Thursday October 6, 2022
“Feeling a little uncomfortable with your skills is a sign of learning, and continuous learning is what the tech industry thrives on!” — Vanessa Hurst
Today
- Fitting and interpreting a spatially-lagged Y model
- Fitting and interpreting a spatially-lagged X model
- Fitting and interpreting spatial Durban models
Ordinary least-squares regression models fit to spatial data can lead to improper inference because observations are not independent. This might lead to poor policy decisions. Thus it’s necessary to check the residuals from an aspatial model for autocorrelation. If the residuals are strongly correlated the model is not specified properly.
Fitting and interpreting a spatially-lagged Y model
Continuing with the Columbus crime data.
( CC.sf <- sf::st_read(dsn = here::here("data", "columbus"),
layer = "columbus") )## Reading layer `columbus' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/columbus'
## using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## First 10 features:
## AREA PERIMETER COLUMBUS_ COLUMBUS_I POLYID NEIG HOVAL INC CRIME
## 1 0.309441 2.440629 2 5 1 5 80.467 19.531 15.725980
## 2 0.259329 2.236939 3 1 2 1 44.567 21.232 18.801754
## 3 0.192468 2.187547 4 6 3 6 26.350 15.956 30.626781
## 4 0.083841 1.427635 5 2 4 2 33.200 4.477 32.387760
## 5 0.488888 2.997133 6 7 5 7 23.225 11.252 50.731510
## 6 0.283079 2.335634 7 8 6 8 28.750 16.029 26.066658
## 7 0.257084 2.554577 8 4 7 4 75.000 8.438 0.178269
## 8 0.204954 2.139524 9 3 8 3 37.125 11.337 38.425858
## 9 0.500755 3.169707 10 18 9 18 52.600 17.586 30.515917
## 10 0.246689 2.087235 11 10 10 10 96.400 13.598 34.000835
## OPEN PLUMB DISCBD X Y NSA NSB EW CP THOUS NEIGNO
## 1 2.850747 0.217155 5.03 38.80 44.07 1 1 1 0 1000 1005
## 2 5.296720 0.320581 4.27 35.62 42.38 1 1 0 0 1000 1001
## 3 4.534649 0.374404 3.89 39.82 41.18 1 1 1 0 1000 1006
## 4 0.394427 1.186944 3.70 36.50 40.52 1 1 0 0 1000 1002
## 5 0.405664 0.624596 2.83 40.01 38.00 1 1 1 0 1000 1007
## 6 0.563075 0.254130 3.78 43.75 39.28 1 1 1 0 1000 1008
## 7 0.000000 2.402402 2.74 33.36 38.41 1 1 0 0 1000 1004
## 8 3.483478 2.739726 2.89 36.71 38.71 1 1 0 0 1000 1003
## 9 0.527488 0.890736 3.17 43.44 35.92 1 1 1 0 1000 1018
## 10 1.548348 0.557724 4.33 47.61 36.42 1 1 1 0 1000 1010
## geometry
## 1 POLYGON ((8.624129 14.23698...
## 2 POLYGON ((8.25279 14.23694,...
## 3 POLYGON ((8.653305 14.00809...
## 4 POLYGON ((8.459499 13.82035...
## 5 POLYGON ((8.685274 13.63952...
## 6 POLYGON ((9.401384 13.5504,...
## 7 POLYGON ((8.037741 13.60752...
## 8 POLYGON ((8.247527 13.58651...
## 9 POLYGON ((9.333297 13.27242...
## 10 POLYGON ((10.08251 13.03377...
Recall interest with these data centers on crime as a response variable and income and housing value as explanatory variables. You set the formula as a string and refit a OLS regression model.
f <- CRIME ~ INC + HOVAL
( model.ols <- lm(f, data = CC.sf) )##
## Call:
## lm(formula = f, data = CC.sf)
##
## Coefficients:
## (Intercept) INC HOVAL
## 68.6190 -1.5973 -0.2739
The marginal effect of income on crime is -1.6 and the marginal effect of housing value on crime is -.27.
A nice way to visualize the relative significance of the explanatory variables is to make a plot. Here you use the broom::tidy() method and then ggplot() as follows.
if(!require(broom)) install.packages(pkgs = "broom", repos = "http://cran.us.r-project.org")## Loading required package: broom
library(broom)
( d <- broom::tidy(model.ols,
conf.int = TRUE) )## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 68.6 4.74 14.5 9.21e-19 59.1 78.2
## 2 INC -1.60 0.334 -4.78 1.83e- 5 -2.27 -0.925
## 3 HOVAL -0.274 0.103 -2.65 1.09e- 2 -0.482 -0.0662
library(ggplot2)
ggplot(d[-1,], aes(x = estimate, # we do not plot the intercept term
y = term,
xmin = conf.low,
xmax = conf.high,
height = 0)) +
geom_point(size = 2) +
geom_vline(xintercept = 0, lty = 4) +
geom_errorbarh()
The maximum likelihood estimate is shown as a point and the confidence interval around the estimate is shown as a horizontal error bar. The default confidence level is 95% (conf.level = .95). The effects are statistically significant as the confidence intervals do not intersect the zero line (dashed-dotted).
Then check for spatial autocorrelation in the residuals. This is done by first defining the weights matrix and then applying Moran’s I test as follows.
nbs <- spdep::poly2nb(CC.sf,
queen = TRUE)
wts <- spdep::nb2listw(nbs)
spdep::lm.morantest(model.ols,
listw = wts)##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = f, data = CC.sf)
## weights: wts
##
## Moran I statistic standard deviate = 2.8393, p-value = 0.00226
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 0.222109407 -0.033418335 0.008099305
The results show that the model residuals have significant spatial autocorrelation so reporting the marginal effects with an OLS regression model would not be correct.
You then fit a spatially-lagged Y model using the lagsarlm() function from the {spatialreg} package. The model is
\[ y = \rho W y + X \beta + \varepsilon \] where \(Wy\) is the weighted average of the neighborhood response values (spatial lag variable) with \(W\) the spatial weights matrix, and \(\rho\) is the autoregression coefficient.
The spatialreg::lagsarlm() function first determines a value for \(\rho\) ( with the internal optimize() function) and then the \(\beta\)’s are obtained using generalized least squares (GLS). The model formula f is the same as what you used to fit the OLS regression above. You save the model object as model.slym.
if(!require(spatialreg)) install.packages(pkgs = "spatialreg", repos = "http://cran.us.r-project.org")## Loading required package: spatialreg
## Loading required package: Matrix
##
## Attaching package: 'spatialreg'
## The following objects are masked from 'package:spdep':
##
## get.ClusterOption, get.coresOption, get.mcOption,
## get.VerboseOption, get.ZeroPolicyOption, set.ClusterOption,
## set.coresOption, set.mcOption, set.VerboseOption,
## set.ZeroPolicyOption
model.slym <- spatialreg::lagsarlm(formula = f,
data = CC.sf,
listw = wts)
summary(model.slym)##
## Call:spatialreg::lagsarlm(formula = f, data = CC.sf, listw = wts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.652017 -5.334611 0.071473 6.107196 23.302618
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 45.603250 7.257404 6.2837 3.306e-10
## INC -1.048728 0.307406 -3.4115 0.000646
## HOVAL -0.266335 0.089096 -2.9893 0.002796
##
## Rho: 0.42333, LR test value: 9.4065, p-value: 0.0021621
## Asymptotic standard error: 0.11951
## z-value: 3.5422, p-value: 0.00039686
## Wald statistic: 12.547, p-value: 0.00039686
##
## Log likelihood: -182.674 for lag model
## ML residual variance (sigma squared): 96.857, (sigma: 9.8416)
## Number of observations: 49
## Number of parameters estimated: 5
## AIC: 375.35, (AIC for lm: 382.75)
## LM test for residual autocorrelation
## test value: 0.24703, p-value: 0.61917
The first batch of output concerns the model residuals and the coefficients on the explanatory variables. The model residuals are the observed crime rates minus the predicted crime rates.
The coefficients on income and housing have the same sign (negative) and they remain statistically significant (-1.05 for income and -.27 for housing value). But you can’t interpret these coefficients as the marginal effects.
The next set of output is about the coefficient of spatial autocorrelation (\(\rho\)). The value is .423 and a likelihood ratio test gives a value of 9.41 which translates to a \(p\)-value of .002. The null hypothesis is the autocorrelation is zero, so you confidently reject it. This is consistent with the significant Moran’s I value that you found in the linear model residuals.
Two other tests are performed on the value of \(\rho\) including a z-test (t-test) using the asymptotic standard error and a Wald test. Both tests confirm that the lag term should be included in the model from crime involving income and housing values.
In spatial models that contain a lagged response term, the coefficients are not marginal effects. The spatial lag model allows for ‘spillover’. That is a change in an explanatory variable anywhere in the study domain will affect the value of the response variable everywhere. Spillover occurs even when the neighborhood weights matrix represents local contiguity. The spillover makes interpreting the coefficients more complicated.
With a spatially-lagged Y model a change in the value of an explanatory variable results in both direct and indirect effects on the response variable.
For example, the direct effect gives the impact a change in income has on crime averaged over all tracts. It takes into account the effects that occur from a change in the \(i\)th tract’s income on crime across neighboring tracts.
The indirect effect gives the impact of a change in income has on crime averaged over all other tracts. The indirect effect represent spillovers. The influences on the dependent variable \(y\) in a region rendered by change in \(x\) in some other region. For example, if all tracts \(i \ne j\) (i not equal to j) increase their income, what will be the impact on crime in region \(i\)?
The total effect (TE) is the sum of the direct and indirect effects. It measures the total cumulative impact on crime arising from one tract \(j\) increasing its income over all other tracts (on average). It is given by
\[ \hbox{TE} = \left(\frac{\beta_k}{1-\rho^2}\right)\left(1 + \rho\right) \] where \(\beta_k\) is the marginal effect of variable \(k\) and \(\rho\) is the spatial autocorrelation coefficient. With \(\rho = 0\) TE is \(\beta_k\).
Here \(\beta_{INC}\) is -1.0487 and \(\rho\) is .4233, so the total effect is
( TE_INC <- -1.0487 / (1 - .4233^2) * (1 + .4233) )## [1] -1.81845
The direct, indirect, and total effects are shown using the spatialreg::impacts() function.
spatialreg::impacts(model.slym,
listw = wts)## Impact measures (lag, exact):
## Direct Indirect Total
## INC -1.1008955 -0.7176833 -1.8185788
## HOVAL -0.2795832 -0.1822627 -0.4618459
The direct effects are the changes in the response variable of a particular region arising from a one unit increase in an explanatory variable in that region.
The indirect effects are the changes in the response variable of a particular region arising from a one unit increase in an explanatory variable in another region. For example, due to spatial autocorrelation, a one-unit change in the income variable in region 1 affects the crime rate in regions 2 and 3.
The next set of output concerns the overall model fit. It includes the log likelihood value and the AIC (Akaike Information Criterion). The AIC value for the linear model is included. Here it is clear that the spatial lag model is an improvement (smaller AIC) over the aspatial model.
The larger the likelihood, the better the model and two times the difference in log likelihoods from two competing models divided by the number of observations gives a scale for how much improvement.
x <- 2 * (logLik(model.slym) - logLik(model.ols))/49
x[1]## [1] 0.1919701
Improvement table
| Likelihood difference | Qualitative improvement |
|---|---|
| 1 | huge |
| .1 | large |
| .01 | good |
| .001 | okay |
The final bit of output is a Lagrange multiplier test for remaining autocorrelation. The null hypothesis is there is no remaining autocorrelation since we have a lag term in the model. The result is a high \(p\)-value so you are satisfied that the lag term takes care of the autocorrelation.
Compare the spatial lag model to a spatial error model. Here you use the spatialreg::errorsarlm() function.
model.sem <- spatialreg::errorsarlm(formula = f,
data = CC.sf,
listw = wts)
summary(model.sem)##
## Call:spatialreg::errorsarlm(formula = f, data = CC.sf, listw = wts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.65998 -6.16943 -0.70623 7.75392 23.43878
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 60.279469 5.365594 11.2344 < 2.2e-16
## INC -0.957305 0.334231 -2.8642 0.0041806
## HOVAL -0.304559 0.092047 -3.3087 0.0009372
##
## Lambda: 0.54675, LR test value: 7.2556, p-value: 0.0070679
## Asymptotic standard error: 0.13805
## z-value: 3.9605, p-value: 7.4786e-05
## Wald statistic: 15.686, p-value: 7.4786e-05
##
## Log likelihood: -183.7494 for error model
## ML residual variance (sigma squared): 97.674, (sigma: 9.883)
## Number of observations: 49
## Number of parameters estimated: 5
## AIC: 377.5, (AIC for lm: 382.75)
You find the coefficient of spatial autocorrelation (\(\lambda\)) is significant, but the log likelihood value from the model is smaller (-183.7) and the AIC value is larger (377.5) compared with corresponding values from the lag model. This is consistent with the lagrange multiplier (LM) tests indicating the spatial lag model is more appropriate.
Also you can compare the log likelihoods from the two spatial regression models that you fit.
x <- 2 * (logLik(model.slym) - logLik(model.sem))/49
x[1]## [1] 0.04389617
With a value of .04 you conclude that there is good improvement of the lag model over the error model. Again, this is consistent with your decision above to use the lag model.
With the spatial error model the coefficients can be interpreted as marginal effects like with the OLS model.
If there are large differences (e.g., different signs) between the coefficient estimate from SEM and OLS, this suggests that neither model is yielding parameters estimates matching the underlying parameters of the data generating process.
You test whether there is a significant difference in coefficient estimates with the Hausman test under the hypothesis of no difference.
spatialreg::Hausman.test(model.sem)##
## Spatial Hausman test (asymptotic)
##
## data: NULL
## Hausman test = 5.6132, df = 3, p-value = 0.132
The \(p\)-value gives inconclusive evidence that the coefficients are different and that maybe the SEM is not the right way to proceed with these data.
The predict() method implements the predict.sarlm() function to calculate predictions from the spatial regression model. The prediction on a spatial lag Y model is decomposed into a “trend” term (explanatory variable effect) and a “signal” term (spatial smoother). The predicted fit is the sum of the trend and the signal terms when using the spatial lag model.
You make predictions with the predict() method under the assumption that the mean response is known. You examine the structure of the corresponding predict object.
( predictedValues <- predict(model.slym) )## This method assumes the response is known - see manual page
## fit trend signal
## 1 14.151553 3.689376 10.462177
## 2 22.577864 11.466910 11.110954
## 3 34.302562 21.851821 12.450741
## 4 46.732511 32.065778 14.666733
## 5 44.747335 27.617335 17.130001
## 6 38.333111 21.136061 17.197049
## 7 37.830286 16.778971 21.051314
## 8 41.393775 23.826139 17.567636
## 9 28.792040 13.151106 15.640934
## 10 16.390116 5.667968 10.722148
## 11 53.631524 32.525601 21.105923
## 12 48.074429 29.765567 18.308862
## 13 40.608933 24.482783 16.126150
## 14 41.856029 23.729007 18.127021
## 15 51.665885 30.455130 21.210754
## 16 54.767238 32.599604 22.167634
## 17 31.866732 24.208333 7.658399
## 18 37.461969 15.795681 21.666289
## 19 44.929428 25.269281 19.660147
## 20 5.110404 -8.624965 13.735369
## 21 47.617356 29.109014 18.508343
## 22 40.412907 25.213797 15.199111
## 23 18.640125 10.704444 7.935681
## 24 39.747460 16.504544 23.242917
## 25 53.116667 31.962568 21.154099
## 26 52.303708 31.717686 20.586022
## 27 39.228078 25.171897 14.056180
## 28 51.354572 31.278691 20.075881
## 29 49.767662 27.843360 21.924302
## 30 45.589426 25.027103 20.562323
## 31 27.465214 19.368347 8.096867
## 32 20.869990 15.004949 5.865041
## 33 44.697299 28.916463 15.780837
## 34 31.720868 22.349636 9.371232
## 35 38.985264 24.973807 14.011457
## 36 24.222607 16.283179 7.939428
## 37 37.811893 16.224746 21.587148
## 38 46.388525 27.909226 18.479299
## 39 22.524680 15.679043 6.845638
## 40 6.730001 -2.182900 8.912900
## 41 20.020878 11.101447 8.919431
## 42 14.764446 6.662086 8.102360
## 43 40.034408 24.726462 15.307946
## 44 34.026283 18.893555 15.132728
## 45 36.970894 23.393214 13.577680
## 46 13.189170 6.118277 7.070893
## 47 21.849812 14.410621 7.439191
## 48 38.162353 26.076852 12.085501
## 49 27.876102 16.356569 11.519533
The predicted values are in the column labeled fit. The predicted values are a sum of the trend term (\(X\beta\)) and the signal term (\(\rho W y\)). The signal term is called the spatial smoother.
As a first-order check if things are what you think they are, compare the first five predicted values with the corresponding observed values.
predictedValues[1:5]## 1 2 3 4 5
## 14.15155 22.57786 34.30256 46.73251 44.74734
CC.sf$CRIME[1:5]## [1] 15.72598 18.80175 30.62678 32.38776 50.73151
Some predicted values are lower than the corresponding observed values and some are higher.
The predicted values along with the values for the trend and signal are added to the simple features data frame.
CC.sf$fit <- as.numeric(predictedValues)
CC.sf$trend <- attr(predictedValues, "trend")
CC.sf$signal <- attr(predictedValues, "signal")The components of the predictions are mapped and placed on the same plot.
library(ggplot2)
( g1 <- ggplot() +
geom_sf(data = CC.sf, aes(fill = fit)) +
scale_fill_viridis_c() +
ggtitle("Predicted Crime") )
( g2 <- ggplot() +
geom_sf(data = CC.sf, aes(fill = trend)) +
scale_fill_viridis_c() +
ggtitle("Trend (Explanatory Variables)") )
( g3 <- ggplot() +
geom_sf(data = CC.sf, aes(fill = signal)) +
scale_fill_viridis_c() +
ggtitle("Signal") )
library(patchwork)
g1 + g2 + g3
The trend term and the spatial smoother have similar ranges indicating nearly equal contributions to the predictions. The largest difference between the two terms occurs in the city’s east side.
A map of the difference makes this clear.
CC.sf <- CC.sf |>
dplyr::mutate(CovMinusSmooth = trend - signal)
tmap::tm_shape(CC.sf) +
tmap::tm_fill(col = "CovMinusSmooth")## Warning: Currect projection of shape CC.sf unknown. Long-lat (WGS84) is assumed.
## Variable(s) "CovMinusSmooth" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

How many tracts have a smaller residual with the lag model versus the OLS model?
CC.sf |>
dplyr::mutate(residualsL = CRIME - fit,
lagWins = abs(residuals(model.ols)) > abs(residualsL),
CovMinusSmooth = trend - signal) |>
sf::st_drop_geometry() |>
dplyr::summarize(N = sum(lagWins))## N
## 1 32
In 32 out of the 49 tracts the residuals from the spatial model are smaller than the residuals from the OLS model.
Fitting and interpreting a spatially-lagged X model
Another spatial regression option is to modify the linear model to include spatially-lagged explanatory variables. This is called the spatially-lagged X model. \[ y = X \beta + WX \theta + \varepsilon \]
In this case the weights matrix is (post) multiplied by the matrix of X variables where \(W\) is again the weights matrix and \(\theta\) is a vector of coefficients for each lagged explanatory variable.
Here you fit the spatially-lagged X model using the spatialreg::lmSLX() function and save the model object as model.slxm.
( model.slxm <- spatialreg::lmSLX(formula = f,
data = CC.sf,
listw = wts) )##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Coefficients:
## (Intercept) INC HOVAL lag.INC lag.HOVAL
## 74.5534 -1.0974 -0.2944 -1.3987 0.2148
With this model, beside the direct marginal effects of income and housing value on crime, you also have the spatially-lagged indirect effects.
The total effect of income on crime is the sum of the direct effect and indirect effect. And again, using the spatialreg::impacts() function you see this.
spatialreg::impacts(model.slxm, listw = wts)## Impact measures (SlX, estimable):
## Direct Indirect Total
## INC -1.0973898 -1.398746 -2.49613551
## HOVAL -0.2943898 0.214841 -0.07954881
You get the impact measures and their standard errors, z-values and \(p\)-values with the summary() method applied to the output of the impacts() function.
summary(spatialreg::impacts(model.slxm, listw = wts))## Impact measures (SlX, estimable, n-k):
## Direct Indirect Total
## INC -1.0973898 -1.398746 -2.49613551
## HOVAL -0.2943898 0.214841 -0.07954881
## ========================================================
## Standard errors:
## Direct Indirect Total
## INC 0.3738313 0.5601247 0.4929713
## HOVAL 0.1016586 0.2079212 0.2074767
## ========================================================
## Z-values:
## Direct Indirect Total
## INC -2.935522 -2.497204 -5.0634496
## HOVAL -2.895867 1.033281 -0.3834108
##
## p-values:
## Direct Indirect Total
## INC 0.0033299 0.012518 4.1174e-07
## HOVAL 0.0037811 0.301473 0.70142
Results show that income has a significant direct and indirect effect on crime rates, but housing values only show a significant direct effect and not a significant indirect effect.
Again you visualize the relative significance of the effects.
model.slxm |>
broom::tidy(conf.int = TRUE) |>
dplyr::slice(-1) |>
ggplot(aes(x = estimate,
y = term,
xmin = conf.low,
xmax = conf.high,
height = 0)) +
geom_point(size = 2) +
geom_vline(xintercept = 0, lty = 4) +
geom_errorbarh()## Warning: The `tidy()` method for objects of class `SlX` is not maintained by the broom team, and is only supported through the `lm` tidier method. Please be cautious in interpreting and reporting broom output.
##
## This warning is displayed once per session.

Compare R squared values between the OLS model and the spatially-lagged X model.
summary(model.ols)$r.squared## [1] 0.552404
summary(model.slxm)$r.squared## [1] 0.6105076
The spatially lagged model has an R squared value that is higher than the R squared value from the linear regression.
Fitting and interpreting spatial Durbin models
A workflow for finding the correct spatial model is to consider both the spatial Durbin error model and the spatial Durbin model.
The spatial Durban error model (SDEM) is a spatial error model with a spatially-lagged X term added. To fit a SDEM use the spatialreg::errorsarlm() function but include the argument etype = "emixed" to ensure that the spatially lagged X variables are added and the lagged intercept term is dropped when the weights style is row standardized ("W").
( model.sdem <- spatialreg::errorsarlm(formula = f,
data = CC.sf,
listw = wts,
etype = "emixed") )##
## Call:
## spatialreg::errorsarlm(formula = f, data = CC.sf, listw = wts,
## etype = "emixed")
## Type: error
##
## Coefficients:
## lambda (Intercept) INC HOVAL lag.INC lag.HOVAL
## 0.4035821 73.6450826 -1.0522585 -0.2781741 -1.2048761 0.1312451
##
## Log likelihood: -181.779
The spatial Durban model (SDM) is a spatially-lagged Y model with a spatially-lagged X term added to it. To fit a SDM use the lagsarlm() function but include the argument type = "mixed" to ensure that the spatially lagged X variables are added and the lagged intercept term is dropped when the weights style is row standardized ("W").
( model.sdm <- spatialreg::lagsarlm(formula = f,
data = CC.sf,
listw = wts,
type = "mixed") )##
## Call:
## spatialreg::lagsarlm(formula = f, data = CC.sf, listw = wts,
## type = "mixed")
## Type: mixed
##
## Coefficients:
## rho (Intercept) INC HOVAL lag.INC lag.HOVAL
## 0.4034626 44.3200052 -0.9199061 -0.2971294 -0.5839133 0.2576843
##
## Log likelihood: -181.6393
How to do you choose between these two models? Is the relationship between crime and income and housing values a global or local effect? Is there any reason to think that if something happens in one tract it will spillover across the entire city? If crime happens in one tract does it influence crime across the entire city? If so, then it is a global relationship. Or should it be a more local effect? If there is more crime in one tract then maybe that influences crime in the neighboring tract but not tracts farther away. If so, then it is a local relationship.
If you think it is a local relationship, start with the spatial Durbin error model and look at the \(p\)-values on the direct and indirect effects.
summary(spatialreg::impacts(model.sdem,
listw = wts,
R = 500), zstats = TRUE)## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## INC -1.0522585 -1.2048761 -2.257135
## HOVAL -0.2781741 0.1312451 -0.146929
## ========================================================
## Standard errors:
## Direct Indirect Total
## INC 0.32127932 0.5736416 0.6326029
## HOVAL 0.09114185 0.2072449 0.2372854
## ========================================================
## Z-values:
## Direct Indirect Total
## INC -3.275214 -2.100399 -3.568012
## HOVAL -3.052101 0.633285 -0.619208
##
## p-values:
## Direct Indirect Total
## INC 0.0010558 0.035694 0.0003597
## HOVAL 0.0022725 0.526548 0.5357794
You see that income has a statistically significant direct and indirect effect on crime. This means that tracts with higher income have lower crime and tracts whose neighboring tracts have higher income also have lower crime.
On the other hand, housing values have only a statistically significant direct effect on crime. Tracts with more expensive houses have lower crime but tracts whose neighboring tracts have more expensive houses do not imply lower crime. And the total effect of housing values on crime across the city is not significant. So if housing values go up in tracts citywide, there is no statistical evidence that crime will go down (or up).
Try a likelihood ratio test with the null hypothesis being that you should restrict the model.
spatialreg::LR.Sarlm(model.sdem,
model.slxm)##
## Likelihood ratio for spatial linear models
##
## data:
## Likelihood ratio = 4.3832, df = 1, p-value = 0.03629
## sample estimates:
## Log likelihood of model.sdem Log likelihood of model.slxm
## -181.7790 -183.9706
The relatively small \(p\)-value suggests you shouldn’t restrict the spatial Durbin model to just the spatially-lagged X model although the evidence is not overwhelming.
More information
Tuesday October 11, 2022
“We build our computer systems the way we build or cities; over time, without plan, on top of ruins.” – Ellen Ullman
Today
- Fitting and interpreting geographic regressions
- Mapping incidence and risk with spatial regression models
Fitting and interpreting geographic regressions
Another approach to modeling spatial data is to assume that the relationships between the response variable and the explanatory variables are modified by contextual factors that depend on location. In this case you fit a separate regression model at each geographic location.
The analogy is a local measure of spatial autocorrelation where you estimate the statistic at each location. It is a useful approach for exploratory analysis (e.g., to show where the explanatory variables are most strongly related to the response variable). It is called geographically weighted regression (GWR) or simply geographic regression. GWR is used in epidemiology, particularly for research on infectious diseases and for evaluating health policies and programs.
Since GWR fits a separate regression model using data focused at every spatial location in the dataset, it is not a single model but a procedure for fitting a set of models. This is different from the spatial regression such as the spatially-lagged Y model, which are single models with spatial terms.
Observations across the entire domain contribute to the model fit at a particular location, but the observations are weighted inversely by their distance to the particular location. At short distances, observations are given the largest weights based on a Gaussian function and a bandwidth. The bandwidth is specified as a single parameter or it is determined through a cross-validation procedure. The bandwidth can also be a function of location.
Said another way, linear regression is a model for the conditional mean. The mean of the response variable depends on the explanatory variables. Geographic regressions show how this dependency varies by location. GWR is used as an exploratory technique for determining where local regression coefficients are different from corresponding global values.
Continuing with the Columbus crime data.
( CC.sf <- sf::st_read(dsn = here::here("data", "columbus"),
layer = "columbus") )## Reading layer `columbus' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/columbus'
## using driver `ESRI Shapefile'
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## Simple feature collection with 49 features and 20 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 5.874907 ymin: 10.78863 xmax: 11.28742 ymax: 14.74245
## CRS: NA
## First 10 features:
## AREA PERIMETER COLUMBUS_ COLUMBUS_I POLYID NEIG HOVAL INC CRIME
## 1 0.309441 2.440629 2 5 1 5 80.467 19.531 15.725980
## 2 0.259329 2.236939 3 1 2 1 44.567 21.232 18.801754
## 3 0.192468 2.187547 4 6 3 6 26.350 15.956 30.626781
## 4 0.083841 1.427635 5 2 4 2 33.200 4.477 32.387760
## 5 0.488888 2.997133 6 7 5 7 23.225 11.252 50.731510
## 6 0.283079 2.335634 7 8 6 8 28.750 16.029 26.066658
## 7 0.257084 2.554577 8 4 7 4 75.000 8.438 0.178269
## 8 0.204954 2.139524 9 3 8 3 37.125 11.337 38.425858
## 9 0.500755 3.169707 10 18 9 18 52.600 17.586 30.515917
## 10 0.246689 2.087235 11 10 10 10 96.400 13.598 34.000835
## OPEN PLUMB DISCBD X Y NSA NSB EW CP THOUS NEIGNO
## 1 2.850747 0.217155 5.03 38.80 44.07 1 1 1 0 1000 1005
## 2 5.296720 0.320581 4.27 35.62 42.38 1 1 0 0 1000 1001
## 3 4.534649 0.374404 3.89 39.82 41.18 1 1 1 0 1000 1006
## 4 0.394427 1.186944 3.70 36.50 40.52 1 1 0 0 1000 1002
## 5 0.405664 0.624596 2.83 40.01 38.00 1 1 1 0 1000 1007
## 6 0.563075 0.254130 3.78 43.75 39.28 1 1 1 0 1000 1008
## 7 0.000000 2.402402 2.74 33.36 38.41 1 1 0 0 1000 1004
## 8 3.483478 2.739726 2.89 36.71 38.71 1 1 0 0 1000 1003
## 9 0.527488 0.890736 3.17 43.44 35.92 1 1 1 0 1000 1018
## 10 1.548348 0.557724 4.33 47.61 36.42 1 1 1 0 1000 1010
## geometry
## 1 POLYGON ((8.624129 14.23698...
## 2 POLYGON ((8.25279 14.23694,...
## 3 POLYGON ((8.653305 14.00809...
## 4 POLYGON ((8.459499 13.82035...
## 5 POLYGON ((8.685274 13.63952...
## 6 POLYGON ((9.401384 13.5504,...
## 7 POLYGON ((8.037741 13.60752...
## 8 POLYGON ((8.247527 13.58651...
## 9 POLYGON ((9.333297 13.27242...
## 10 POLYGON ((10.08251 13.03377...
Start by fitting a ‘global’ ordinarly least squares (OLS) linear regression to the crime rates using income and housing values, exactly as you did earlier.
f <- CRIME ~ INC + HOVAL
( model.ols <- lm(formula = f,
data = CC.sf) )##
## Call:
## lm(formula = f, data = CC.sf)
##
## Coefficients:
## (Intercept) INC HOVAL
## 68.6190 -1.5973 -0.2739
The coefficients on the two explanatory variables indicate that crime decreases in areas of higher income and higher housing values.
You compare this result to results from geographic regressions. The functions are in the {spgwr} package.
if(!require(spgwr)) install.packages(pkgs = "spgwr", repos = "http://cran.us.r-project.org")## Loading required package: spgwr
## NOTE: This package does not constitute approval of GWR
## as a method of spatial analysis; see example(gwr)
The sp part of the package name indicates that the functions were developed to work with S4 spatial objects.
The functions allow you to use S3 simple features by specifying the locations as a matrix. Here you extract the centroid from each census tract as a matrix.
Locations <- sf::st_coordinates(sf::st_centroid(CC.sf))## Warning in st_centroid.sf(CC.sf): st_centroid assumes attributes are constant
## over geometries of x
head(Locations)## X Y
## 1 8.827218 14.36908
## 2 8.332658 14.03162
## 3 9.012265 13.81972
## 4 8.460801 13.71696
## 5 9.007982 13.29637
## 6 9.739926 13.47463
These are the X and Y coordinate values specifying the centroid for the first six tracts (out of 49).
To determine the optimal bandwidth for the Gaussian kernel (weighting function) you use the spgwr::gwr.sel() function. You need to specify the arguments, model formula (formula =), the data frame (data =), and the coordinates (coords =) as part of the function call. The argument coords = is the matrix of coordinates of points representing the spatial locations of the observations. It can be omitted if the data is an S4 spatial data frame from the {sp} package.
( bw <- spgwr::gwr.sel(formula = f,
data = CC.sf,
coords = Locations) )## Bandwidth: 2.220031 CV score: 7473.853
## Bandwidth: 3.588499 CV score: 7479.637
## Bandwidth: 1.374271 CV score: 7404.175
## Bandwidth: 0.8515626 CV score: 7389.293
## Bandwidth: 0.7515898 CV score: 7280.867
## Bandwidth: 0.4667245 CV score: 6319.861
## Bandwidth: 0.290668 CV score: 7474.967
## Bandwidth: 0.5755334 CV score: 6754.626
## Bandwidth: 0.3994769 CV score: 6197.735
## Bandwidth: 0.3597549 CV score: 6320.012
## Bandwidth: 0.4132551 CV score: 6200.674
## Bandwidth: 0.4028088 CV score: 6196.867
## Bandwidth: 0.4040147 CV score: 6196.817
## Bandwidth: 0.4038422 CV score: 6196.816
## Bandwidth: 0.4038829 CV score: 6196.816
## Bandwidth: 0.4038015 CV score: 6196.816
## Bandwidth: 0.4038422 CV score: 6196.816
## [1] 0.4038422
The procedure makes an initial guess at the optimal bandwidth distance and then fits local regression models at each location using weights that decay defined by the kernel (guassian by default) and that bandwidth (distance).
The output shows that the first bandwidth chosen was 2.22 in arbitrary distance units. The resulting prediction skill from fitting 49 regression models with that bandwidth is 7474 units. The resulting CV score is based on cross validation whereby skill is computed at each location when data from that location is not used to fit the regression models.
The procedure continues by increasing the bandwidth distance (to 3.59) and then computing a new CV score from refitting the regression models. Since the new CV score is higher (7480) than the initial CV score (7474), the bandwidth is changed in the other direction (decreasing from 2.22 to 1.37) and the models again are refit. With that bandwidth, the CV score is 7404, which is lower than the initial bandwidth so the bandwidth is decreased again. The procedure continues until no additional improvement in prediction skill occurs.
The output shows that no additional improvement in skill occurs at a bandwidth distance of .404 units, and this single value is assigned to the object you called bw.
Once the bandwidth distance is determined you use the spgwr::gwr() function to fit the regressions using that bandwidth. The arguments are the same as before but includes the bandwidth = argument where you specify the object bw.
model.gwr <- gwr(formula = f,
data = CC.sf,
coords = Locations,
bandwidth = bw)The model and observed data are assigned to a list object with element names listed using the names() function.
names(model.gwr)## [1] "SDF" "lhat" "lm" "results" "bandwidth" "adapt"
## [7] "hatmatrix" "gweight" "gTSS" "this.call" "fp.given" "timings"
The first element is SDF containing the model output as a S4 spatial data frame.
class(model.gwr$SDF)## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
See Lesson 7 where S4 spatial data objects were covered.
The structure of the spatial data frame is obtained with the str() function and by setting the max.level argument to 2.
str(model.gwr$SDF, max.level = 2)## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 49 obs. of 7 variables:
## ..@ coords.nrs : num(0)
## ..@ coords : num [1:49, 1:2] 8.83 8.33 9.01 8.46 9.01 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..@ bbox : num [1:2, 1:2] 6.22 11.01 10.95 14.37
## .. ..- attr(*, "dimnames")=List of 2
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
Here there are five slots with the first slot labeled @data indicating that it is a data frame. The number of rows and columns in the data frame are listed with the dim() function.
dim(model.gwr$SDF)## [1] 49 7
There are 49 rows and 7 columns. Each row corresponds to a tract and information about the regressions localized to the tract is given in the columns. Column names are listed with the names() function.
names(model.gwr$SDF)## [1] "sum.w" "(Intercept)" "INC" "HOVAL" "gwr.e"
## [6] "pred" "localR2"
They include the sum of the weights sum.w (the larger the sum the more often the tract is included in the local regressions–favoring smaller counties and ones farther from the borders of the spatial domain), the three regression coefficients one for each of the explanatory variables (INC and HOVAL) and an intercept term, the residual (gwr.e), the predicted value (pred) and the local goodness-of-fit (localR2).
You create a map displaying where income has the most and least influence on crime by first adding the income coefficient from the data frame (column labeled INC) to the simple feature data frame since the order of the rows in the SDF matches the order in the simple feature data frame and then using functions from the {ggplot2} package.
CC.sf$INCcoef <- model.gwr$SDF$INC
library(ggplot2)
ggplot(CC.sf) +
geom_sf(aes(fill = INCcoef)) +
scale_fill_viridis_c()
Most tracts have coefficients with values less than zero. Recall the global coefficient is less than zero. But areas in yellow show where the coefficient values are greater than zero indicating a direct relationship between crime and income.
How about the coefficients on housing values?
CC.sf$HOVALcoef <- model.gwr$SDF$HOVAL
ggplot(CC.sf) +
geom_sf(aes(fill = HOVALcoef)) +
scale_fill_viridis_c()
While the global coefficient is negative indicating crime rates tend to be lower in areas with higher housing values, the opposite is the case over much of city especially on the south side.
You put the vector of GWR predictions into the CC.sf simple feature data frame giving it the column name predGWR and then map the predictions using functions from the {tmap} package.
CC.sf$predGWR <- model.gwr$SDF$pred
tmap::tm_shape(CC.sf) +
tmap::tm_fill("predGWR", title = "Predicted crimes\nper 1000") +
tmap::tm_layout(legend.outside = TRUE)## Warning: Currect projection of shape CC.sf unknown. Long-lat (WGS84) is assumed.

The geographic regressions capture the spatial pattern of crimes across the city. The spread of predicted values matches the observed spread better than the linear model. The pattern of predicted crime is also a smoother than with a global OLS regression.
Where is the relationship between crime and the two explanatory variables the tightest? This is answered by mapping the R squared coefficient for each of the models.
CC.sf$localR2 <- model.gwr$SDF$localR2
ggplot(CC.sf) +
geom_sf(aes(fill = localR2)) +
scale_fill_viridis_c()
Although crime rates are highest in the center, the relationship between crime and income and housing values is largest in tracts across the eastern part of the city.
This type of nuanced exploratory analysis is made possible with GWR.
Also, when fitting a regression model to data that vary spatially you are assuming an underlying stationary process. This means you believe the explanatory variables ‘provoke’ the same response (statistically) across the domain. If this is not the case then it shows up in a map of correlated residuals. One way to check the assumption of a stationary process is to use geographic regression.
Mapping incidence and risk with spatial regression models
Spatial regression models are used in disease mapping where it is common to use a standardized incidence ratio (SIR) defined as the ratio of the observed to the expected number of disease cases. Small areas can give extreme SIRs due to low population sizes or small samples. Extreme values of SIRs can be misleading and unreliable for reporting.
Because of this so-called ‘small area problem’ it is better to estimate disease risk using a spatial regression model. Spatial regression models incorporate information from neighboring areas and explanatory information. The result is a smoothing (shrinking) of extreme values.
Consider county-level lung cancer cases in Pennsylvania from the {SpatialEpi} package. The county boundaries for the state are in the list object pennLC with element name spatial.polygon. Change the native spatial polygons S4 object to an S3 simple feature data frame using the sf::st_as_sf() function and display a map of the county borders.
if(!require(SpatialEpi)) install.packages("SpatialEpi", repos = "http://cran.us.r-project.org")## Loading required package: SpatialEpi
LC.sp <- SpatialEpi::pennLC$spatial.polygon
LC.sf <- sf::st_as_sf(LC.sp)
ggplot(LC.sf) +
geom_sf()
For each region (county) \(i\), \(i = 1, \ldots, n\) the SIR is defined as the ratio of observed counts (\(Y_i\)) to the expected counts (\(E_i\)).
\[ \hbox{SIR}_i = Y_i/E_i. \]
The expected count \(E_i\) is the total number of cases expected if the population of area \(i\) behaves the way the standard population behaves. If you ignore differences in rates for different stratum (e.g., age groups, race, etc) then you compute the expected counts as
\[ E_i = r^{(s)} n^{(i)}, \] where \(r^{(s)}\) is the rate in the standard population (total number of cases divided by the total population across all regions), and \(n^{(i)}\) is the population of region \(i\).
Then \(\hbox{SIR}_i\) indicates whether region \(i\) has higher (\(\hbox{SIR}_i > 1\)), equal (\(\hbox{SIR}_i = 1\)) or lower (\(\hbox{SIR}_i < 1\)) risk than expected relative to the standard population.
When applied to mortality data, the ratio is known as the standardized mortality ratio (SMR).
The data frame SpatialEpi::pennLC$data contains the number of lung cancer cases and the population of Pennsylvania at county level, stratified on race (white and non-white), gender (female and male) and age (under 40, 40-59, 60-69 and 70+).
You compute the number of cases for all the strata together in each county by aggregating the rows of the data frame by county and adding up the number of cases.
( County.df <- SpatialEpi::pennLC$data |>
dplyr::group_by(county) |>
dplyr::summarize(Y = sum(cases)) )## # A tibble: 67 × 2
## county Y
## <fct> <int>
## 1 adams 55
## 2 allegheny 1275
## 3 armstrong 49
## 4 beaver 172
## 5 bedford 37
## 6 berks 308
## 7 blair 127
## 8 bradford 59
## 9 bucks 454
## 10 butler 158
## # … with 57 more rows
You then calculate the expected number of cases in each county using indirect standardization. The expected counts in each county represent the total number of disease cases one would expect if the population in the county behaved the way the population of Pennsylvania behaves. You do this by using the SpatialEpi::expected() function. The function has three arguments including population (vector of population counts for each strata in each area), cases (vector with the number of cases for each strata in each area), and n.strata (number of strata).
The vectors population and cases need to be sorted by area first and then, within each area, the counts for all strata need to be listed in the same order. All strata need to be included in the vectors, including strata with 0 cases. Here you use the dplyr::arrange() function.
Strata.df <- SpatialEpi::pennLC$data |>
dplyr::arrange(county, race, gender, age)
head(Strata.df)## county cases population race gender age
## 1 adams 0 365 o f 40.59
## 2 adams 1 68 o f 60.69
## 3 adams 0 73 o f 70+
## 4 adams 0 1492 o f Under.40
## 5 adams 0 387 o m 40.59
## 6 adams 0 69 o m 60.69
Then you get the expected counts (E) in each county by calling the SpatialEpi::expected() function, where you set population equal to Strata.df$population and cases equal to Strata.df$cases. There are two races, two genders and four age groups for each county, so number of strata is set to 2 x 2 x 4 = 16.
( E <- SpatialEpi::expected(population = Strata.df$population,
cases = Strata.df$cases,
n.strata = 16) )## [1] 69.627305 1182.428036 67.610123 172.558055 44.190132 300.705979
## [7] 115.069655 53.237644 428.797481 134.797705 149.846027 5.945905
## [13] 55.475211 79.404013 300.124058 33.906647 73.853240 33.012029
## [19] 53.312111 75.025024 170.866603 199.809038 454.545971 31.543736
## [25] 216.203436 137.810484 5.403583 109.888662 11.594802 33.093112
## [31] 36.659515 71.090130 42.555800 18.735146 204.172754 357.237966
## [37] 91.303056 103.076598 259.874790 309.688036 101.231639 40.105227
## [43] 111.790653 40.774630 100.094714 608.691819 16.081330 222.731099
## [49] 90.872134 31.037254 1219.102696 39.865269 16.003210 147.937712
## [55] 28.878902 74.523497 7.419682 36.174266 35.756382 30.833836
## [61] 51.141014 39.230710 189.097720 44.490161 351.175955 21.009224
## [67] 288.869666
Now you add the observed count Y, the expected count E the computed SIR to the simple feature data frame LC.sf and make a map of the standardized incidence ratios (SIR) with blue shades below a value of 1 (midpoint) and red shades above a value of 1.
LC.sf <- LC.sf |>
dplyr::mutate(Y = County.df$Y,
E = E,
SIR = Y/E)
ggplot(LC.sf) +
geom_sf(aes(fill = SIR)) +
scale_fill_gradient2(midpoint = 1,
low = "blue",
mid = "white",
high = "red") +
theme_minimal()
In counties with SIR = 1 (white) the number of cancer cases observed is the same as the number of expected cases. In counties with SIR > 1 (red), the number of cancer cases observed is higher than the expected cases. Counties with SIR < 1 (blue) have fewer cancer cases observed than expected.
In regions with few people the expected counts may be very low and the SIR value may be misleading. Therefore, it is preferred to estimate disease risk using models that borrow information from neighboring areas, and incorporate explanatory information. This results in smoothing (shrinkage) of extreme values.
Let the observed counts \(Y\) be modeled with a Poisson distribution having a mean \(E \theta\), where \(E\) are the expected counts and \(\theta\) are the relative risks. The logarithm of the relative risk is expressed as the sum of an intercept that models the overall disease risk level, and random effects to account for local variability.
The relative risk quantifies whether an area has a higher (\(\theta > 1\)) or lower (\(\theta < 1\)) risk than the average risk in the population. For example if \(\theta_i = 2\), then the risk in area \(i\) is twice the average risk in the population.
The model is expressed as
\[ Y \sim \hbox{Poisson}(E\theta) \\ \log(\theta) = \alpha + u + v \]
The parameter \(\alpha\) is the overall risk in the region of study, \(u\) is the spatially structured random effect representing the dependency in risk across neighboring areas, and \(v\) is the uncorrelated random noise modeled as \(v \sim N(0, \sigma_v^2)\).
It is common to include explanatory variables to quantify risk factors (e.g., distance to nearest coal plant). Thus the log(\(\theta\)) is expressed as
\[ \log(\theta) = \alpha + X\beta + u + v \]
where \(X\) are the explanatory variables and \(\beta\) are the associated coefficients. A coefficient is interpreted such that a one-unit increase in the explanatory variable value changes the relative risk by a factor \(\exp(\beta)\), holding the other variables constant.
A popular form for the combined spatially structured random effect and the uncorrelated random effect is the Besag-York-Mollié (BYM) model, which assigns a conditional autoregression distribution to \(u\) as
\[ u | u_{j \ne i} \sim N(\bar u_{\delta}, \frac{\sigma_u^2}{n_{\delta}}) \]
where \(\bar u_{\delta_i} = \Sigma_{j \in \delta_i} u_j/n_{\delta_i}\) and where \(\delta_i\) is the set of neighbors of area \(i\) and \(n_{\delta_i}\) is the number of neighbors of area \(i\).
In words, the logarithm of the disease incidence rate in area \(i\) conditional on the incidence rates in the neighborhood of \(i\) is modeled with a normal distribution centered on the neighborhood average (\(\bar u_{\delta_i}\)) with a variance scaled by the number of neighbors. This is called the conditional autoregressive (CAR) distribution.
The model is fit using an application of Bayes rule through the method of integrated nested Laplace approximation (INLA), which results in posterior densities for the predicted relative risk. This is done with functions in the {INLA} package. You get the package (it is not on CRAN) as follows.
options(timeout = 120)
install.packages("INLA", repos=c(getOption("repos"), INLA = "https://inla.r-inla-download.org/R/stable"), dep = TRUE)The syntax for the BYM model using functions from the {INLA} package is given as
f <- Y ~
f(IDu, model = "besag", graph = g, scale.model = TRUE) +
f(IDv, model = "iid")The formula includes the response in the left-hand side, and the fixed and random effects on the right-hand side. By default, the formula includes an intercept.
The random effects are set using f() with parameters equal to the name of the index variable, the model, and other options. The BYM formula includes a spatially structured random effect with index variable with name IDu and equal to c(1, 2, …, I), where I is the number of regions (here the number of counties) and model "besag" with a CAR distribution and with neighborhood structure given by the graph g. The option scale.model = TRUE is used to make the precision parameter of models with different CAR priors comparable.
The formula also includes an uncorrelated random effect with index variable with name IDv again equal to c(1, 2, …, I), and model “iid”. This is an independent and identically distributed zero-mean normally distributed random effect. Note that both the ID variables are identical but need to be specified as two different objects since INLA does not allow to include two effects with f() that use the same index variable.
The BYM model can also be specified with the model “bym” which defines both the spatially structured random effect and the uncorrelated random effect (\(u\) and \(v\)).
You include these two vectors (call them idu and idv) in the data frame.
LC.sf <- LC.sf |>
dplyr::mutate(IDu = 1:nrow(LC.sf),
IDv = 1:nrow(LC.sf))
LC.sf## Simple feature collection with 67 features and 5 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -80.53494 ymin: 39.72316 xmax: -74.72516 ymax: 42.26137
## CRS: +proj=longlat
## First 10 features:
## geometry Y E SIR IDu IDv
## 1 POLYGON ((-77.4467 39.96954... 55 69.62730 0.7899200 1 1
## 2 POLYGON ((-80.14534 40.6742... 1275 1182.42804 1.0782897 2 2
## 3 POLYGON ((-79.21142 40.9091... 49 67.61012 0.7247435 3 3
## 4 POLYGON ((-80.1568 40.85189... 172 172.55806 0.9967660 4 4
## 5 POLYGON ((-78.38063 39.7288... 37 44.19013 0.8372910 5 5
## 6 POLYGON ((-75.53303 40.4508... 308 300.70598 1.0242563 6 6
## 7 POLYGON ((-78.11707 40.7373... 127 115.06965 1.1036793 7 7
## 8 POLYGON ((-76.14609 42.0035... 59 53.23764 1.1082384 8 8
## 9 POLYGON ((-74.97153 40.0554... 454 428.79748 1.0587749 9 9
## 10 POLYGON ((-80.14534 40.6742... 158 134.79770 1.1721268 10 10
Create a graph object from a neighbor list object. Write the neighbor list object to a file then read it back in with the inla.read.graph() function.
nb <- spdep::poly2nb(LC.sf)
spdep::nb2INLA(file = here::here("data", "map.adj"), nb)
g <- INLA::inla.read.graph(filename = here::here("data", "map.adj"))
class(g)## [1] "inla.graph"
You fit the model by calling the inla() function specifying the formula, the family (“poisson”), the data, and the expected counts (E). You also set control.predictor = list(compute = TRUE) to compute the posteriors predictions.
model.inla <- INLA::inla(formula = f,
family = "poisson",
data = LC.sf,
E = E,
control.predictor = list(compute = TRUE))The estimates of the relative risk of lung cancer and their uncertainty for each of the counties are given by the mean posterior and the 95% credible intervals which are contained in the object model.inla$summary.fitted.values. Column mean is the mean posterior and 0.025quant and 0.975quant are the 2.5 and 97.5 percentiles, respectively.
You add these to the spatial data frame and then make a map of the posterior mean relative risk.
LC.sf$RR <- model.inla$summary.fitted.values[, "mean"]
LC.sf$LL <- model.inla$summary.fitted.values[, "0.025quant"]
LC.sf$UL <- model.inla$summary.fitted.values[, "0.975quant"]
ggplot(LC.sf) +
geom_sf(aes(fill = RR)) +
scale_fill_gradient2(midpoint = 1,
low = "blue",
mid = "white",
high = "red") +
theme_minimal()
These relative risk values are smoother and muted in absolute magnitude compared with the empirical SIR estimates.
More on this topic is available from
- https://www.paulamoraga.com/book-geospatial/index.html
- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166895
The second source is a paper we published addressing long and short term views of tornado risk across the eastern half of the United States.
Tuesday October 18, 2022
“Give someone a program, you frustrate them for a day; teach them how to program, you frustrate them for a lifetime.” - David Leinweber
Today
- Spatial data as point patterns
- Working with point pattern objects using functions from the {spatstat} package
- Quantifying event intensity
Spatial data as point patterns
We now turn our attention to analyzing and modeling point pattern data. Starting with some theory, then how to work with functions from the {spatstat} package before focusing on spatial intensity.
We naturally seek to find patterns in a collection of events. The pattern that tends to catch our attention quickly is the grouping of events across space. Stars in the night sky as constellations. A collection of events in a particular region begs for an explanation. Why do events occur more often in this particular region and not somewhere else?
Consider tornado reports over the past several years in the state of Kansas. Let the start position of a tornado be an event location. And let the damage rating (EF scale) provide a mark on the event. Here you consider only events since 2007 with marks of 1, 2, 3, 4, and 5. More about the damage rating scale is available here https://en.wikipedia.org/wiki/Enhanced_Fujita_scale
Import and filter the data accordingly.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
dplyr::filter(st == "KS",
yr >= 2007,
mag > 0) ## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
Create a map of the event locations using functions from the {tmap} package. The state border is obtained as a simple feature data frame. The polygon geometry is plotted first with tm_borders() then the event locations are plotted with the tm_bubbles() and size = "mag".
KS.sf <- USAboundaries::us_states(states = "Kansas")
tmap::tm_shape(KS.sf) +
tmap::tm_borders(col = "grey70") +
tmap::tm_shape(Torn.sf) +
tmap::tm_bubbles(size = "mag",
col = "red",
alpha = .4,
title.size = "EF Rating") +
tmap::tm_layout(legend.position = c("left", "top"),
legend.outside = TRUE)
Based on this display of tornado genesis locations we ask: (1) Are certain areas of the state more (or less) likely to get a tornado? (2) Do tornadoes tend to cluster? (3) Are there places in the state that are safe from tornadoes?
These questions are similar but they are not identical. We will explore these canonical questions about point pattern data in the next few lessons.
To begin it is helpful to have some definitions.
- Event: An occurrence of interest (e.g., tornado, accident, wildfire).
- Event location: Location of event (e.g., latitude/longitude).
- Point: Any location in the study area where an event could occur. Note: Event location is a particular point where an event did occur. In a forest with a lake, the lake is a place where an event could not occur.
- Point pattern data: A collection of observed (or simulated) event locations together with a domain of interest.
- Domain: Study area that is often defined by data availability (e.g., state or county boundary) or by the extent of the events.
- Complete spatial randomness: Or CSR (not to be confused with CRS–coordinate reference system) defines the situation where an event has an equal chance of occurring at any point in the domain regardless of other nearby events. In this case we say they event locations have a uniform probability distribution (uniformly distributed) across space. Note: uniform chance does not mean that the events have an ordered pattern (e.g., trees in an orchard).
Consider a set of event locations that are randomly distributed within the unit plane. First create two vectors containing the x and y coordinates, then create a data frame that includes the name of the sample, and finally graph the locations using ggplot().
library(ggplot2)
x <- runif(n = 50, min = 0, max = 1)
y <- runif(n = 50, min = 0, max = 1)
df1 <- data.frame(x, y, name = "Point Pattern 1")
ggplot(data = df1,
mapping = aes(x, y)) +
geom_point(size = 2)
The plot shows a sample from a spatial point pattern process. A spatial point process is a mechanism for producing a set of event locations across space. The pattern of locations produced by the point process is described as CSR. There are groups of event locations and some gaps.
Let’s repeat this process to create three additional samples. First you combine them into a single data frame with the rbind() function and then plot a four-panel figure using the facet_wrap() function.
df2 <- data.frame(x = runif(n = 30, min = 0, max = 1),
y = runif(n = 30, min = 0, max = 1),
name = "Point Pattern 2")
df3 <- data.frame(x = runif(n = 30, min = 0, max = 1),
y = runif(n = 30, min = 0, max = 1),
name = "Point Pattern 3")
df4 <- data.frame(x = runif(n = 30, min = 0, max = 1),
y = runif(n = 30, min = 0, max = 1),
name = "Point Pattern 4")
df <- rbind(df1, df2, df3, df4)
ggplot(data = df,
mapping = aes(x, y)) +
geom_point() +
facet_wrap(~ name)
Groups of nearby events illustrate that a certain degree of clustering occurs by chance (without cause) making visual assessment of causal clustering difficult.
Complete spatial randomness sits on a spectrum between regularity and clustered. To illustrate this idea here you generate point pattern data that have more regularity than CSR and point pattern data that are more clustered than CSR. You do this using the rMaternI() and rMaternClust() functions from the {spatstat} package.
m1 <- spatstat.random::rMaternI(kappa = 100, r = .02)
df1 <- data.frame(x = m1$x, y = m1$y, name = "Regular Pattern 1")
m2 <- spatstat.random::rMaternI(kappa = 100, r = .02)
df2 <- data.frame(x = m2$x, y = m2$y, name = "Regular Pattern 2")
m3 <- spatstat.random::rMatClust(kappa = 30, r = .15, mu = 4)
df3 <- data.frame(x = m3$x, y = m3$y, name = "Cluster Pattern 1")
m4 <- spatstat.random::rMatClust(kappa = 30, r = .15, mu = 4)
df4 <- data.frame(x = m4$x, y = m4$y, name = "Cluster Pattern 2")
df <- rbind(df1, df2, df3, df4)
ggplot(data = df,
mapping = aes(x, y)) +
geom_point() +
facet_wrap(~ name)
The difference in the arrangement of event locations between a regular and a cluster process is clear. But the difference in the arrangement of event locations between a CSR and regular process and the difference in the arrangement of event locations between a CSR and cluster process is not.
And spatial scale matters. A set of event locations can be regular on a small scale but clustered on a larger scale.
Probability models for spatial patterns motivate methods for detecting event clustering. A probability model generates a point pattern process. For example, we can think of crime as a point pattern process defined by location and influenced by environmental factors. The probability of a crime occurring at a particular location is the random variable and we can estimate the probability of a crime event at any location given factors that influence crime.
More formally, a spatial point pattern process is a stochastic (statistical) process where event location is the random variable. A sample of the process is a collection of events generated under the probability model.
A spatial point process is said to be stationary if the statistical properties of the events are invariant to translation. This means that the relationship between two events depends only on the relative event locations (not on where the events occur in the domain). Relative location (or spatial lag) refers to distance and orientation of the events relative to one another.
In the case where the statistical properties are independent of the orientation of event pairs, the process is said to be isotropic.
The properties of stationarity and isotropy allow for replication within a data set. Under the assumption of a stationary process, two event pairs that are separated by the same distance should have the same relatedness. This is analogous to the assumption we make when we define our weights matrix for spatially aggregated data. The assumptions of stationarity and isotropy are starting points for modeling point pattern data.
The Poisson distribution defines a model for complete spatial randomness (CSR). A point process is said to be ‘homogeneous Poisson’ under the following two criteria:
- The number of events, N, occurring within a finite domain A is a random variable described by a Poisson distribution with mean \(\lambda\)|A| for some positive constant \(\lambda\), with |A| denoting the area of the domain, and
- The locations of the N events represent a random sample where each point in A is equally likely to be chosen as an event location.
The first criteria of a Poisson distribution refers to a probability model describing the number of events. It expresses the probability of a given number of events occurring in a fixed interval of space when the events occur with a known constant rate.
The Poisson parameter defines the intensity of the point process. Given a set of events, an estimate for the mean (rate) parameter of the Poisson distribution is given by the number of events divided by the domain area.
The second criteria ensures the events are scattered about the domain without clustering or regularity.
The procedure to create a homogeneous Poisson point process follows directly from its definition. Step 1: Sample the total number of events from a Poisson distribution with a mean that is proportional to the domain area. Step 2: Place each event within the domain with coordinates given by a uniform distribution.
For example, let area |A| = 1, and the rate of occurrence \(\lambda\) = 20, then
lambda <- 20
N <- rpois(1, lambda)
x <- runif(N)
y <- runif(N)
df <- data.frame(x, y)
ggplot(data = df,
mapping = aes(x, y)) +
geom_point(size = 2) 
The set of events represents a sample from a homogeneous Poisson point process. The intensity of the events is specified first then the locations are placed uniformly inside the domain. The domain need not be regular. The actual number of events varies from one realization to the next. On average the number of events is 20 but it could vary between 10 and 35 or more.
This point pattern is CSR by construction. However, you are typically in the opposite position. You observe a set of events and you want to know if the events are regular or clustered. The null hypothesis is CSR and you need a test statistic that will summarize the evidence against this hypothesis. The null models are simple so you can use Monte Carlo methods to generate many samples and compare summary statistics from those samples with your observed data.
In some cases the homogeneous Poisson model is not restrictive enough. This means that you can easily reject the null hypothesis but not learn anything interesting about your data. For example, with health events (locations of people with heart disease) CSR is not an appropriate model because a null hypothesis that incidences are equally likely does not consider that people cluster (locations at risk are not uniform).
Each person has the same risk of heart disease regardless of location, and you expect more cases in areas with more people at risk. Clusters of cases in high population areas violate the CSR but not necessarily the constant risk hypothesis. The constant risk hypothesis requires the intensity of the spatial process be defined as a spatially varying function. That is, you define the intensity as \(\lambda(s)\), where \(s\) denotes location.
The intensity (density) function is a first-order property of the random process. If intensity varies (significantly) across the domain the data-generating process is said to be heterogeneous. The intensity function describes the expected number of events at any location. Events might be independent of one another, but groups of events appear because of the changing intensity.
Working with point pattern objects using functions from the {spatstat} package
The {spatstat} package contains many functions to analyze and model point pattern data. Point pattern data are defined in {spatstat} by an object of class ppp (for planar point pattern) which contains the coordinates of the events (event locations), optional values attached to the events (called ‘marks’), and a description of the domain or ‘window’ over which the events are observed. See ?ppp.object() for details.
Spatial statistics computed on a ppp object will be somewhat sensitive to the choice of the window (domain), so some thought should go into deciding what window should be used.
As an example, the data swedishpines is available in the package as a ppp object.
suppressMessages(library(spatstat))
class(swedishpines)## [1] "ppp"
swedishpines## Planar point pattern: 71 points
## window: rectangle = [0, 96] x [0, 100] units (one unit = 0.1 metres)
The data is a planar point pattern object with 71 events.
Note: The events in a ppp object are called ‘points’ rather than events. This is in contrast to the theory that defines a point as represented by a potential event not an observed event.
All the events are contained within a rectangle window of size 9.6 by 10 meters.
There is a plot() method for ppp objects that provides a quick view the data and the domain window.
plot(swedishpines)
Events are plotted as open circles inside a box. The plot is labeled with the name of the ppp object.
The function convexhull() from the {spatstat} package creates a convex hull around the events. A convex hull defines the minimum-area convex polygon that contains all the events.
Here you compute the hull and add it to the plot.
plot(swedishpines)
plot(convexhull(swedishpines),
add = TRUE)
The domain (window) for analysis and modeling should be somewhat larger than the convex hull. The function ripras() computes a spatial domain based on the event locations alone assuming the locations are independent and identically distributed.
Here you also overlay this polygon on the plot.
plot(swedishpines)
plot(convexhull(swedishpines),
add = TRUE)
plot(ripras(swedishpines),
add = TRUE, lty = "dotted")
The window can have an arbitrary shape. A rectangle, a polygon, a collection of polygons including holes, or a binary image (mask). A window can be stored as a separate object of class owin. See ?owin.object() for details.
Each event may carry information called a ‘mark’. A mark can be continuous (e.g. tree height) or discrete (tree species).
A multitype point pattern is one in which the events are marked using a factor (e.g., tree species). The mark values are given in a vector of the same length as the vector of locations. That is, marks[i] is the mark attached to the location (x[i], y[i]).
Consider the ppp object demopat from the {spatstat} package.
plot(demopat)
marks(demopat)## [1] A B B A B B B A A A B A A B B A A A B B A A A A B B B A A B B B B B A A B
## [38] A A B B A A B B B B A B B B B B B B A A A B A B A B B B B B A B B A A B B
## [75] B B B A B B A A B A B B B A B A B B B B B A A B A B B B B B A A A B A B B
## [112] A
## Levels: A B
Here the domain is defined as an irregular concave polygon with a hole. The distinction between inside and outside is important for spatial statistics computed using the events.
For a multitype pattern (where the marks are factors) you can use the split() function to separate the point pattern objects by mark type. Consider the Lansing Woods data set (lansing) with marks corresponding to tree species.
data(lansing)
LW <- lansing
plot(split(LW))
Quantifying event intensity
The average intensity of events is defined as the number of events per unit area of the domain. The summary() method applied to a ppp object gives the average intensity.
summary(swedishpines)## Planar point pattern: 71 points
## Average intensity 0.007395833 points per square unit (one unit = 0.1 metres)
##
## Coordinates are integers
## i.e. rounded to the nearest unit (one unit = 0.1 metres)
##
## Window: rectangle = [0, 96] x [0, 100] units
## Window area = 9600 square units
## Unit of length: 0.1 metres
There are 71 events over a window area (spatial domain) of 9600 square units giving an average intensity of 71/9600 = .0074.
The average intensity might not represent the intensity of events locally. We need a way to describe the expected number of events at any location of the region.
Counting the number of events in equal areas is one way. The quadrat method divides the domain into a grid of rectangular cells and the number of events in each cell is counted. Quadrat counting is done with the quadratcount() function.
quadratcount(swedishpines)## x
## y [0,19.2) [19.2,38.4) [38.4,57.6) [57.6,76.8) [76.8,96]
## [80,100] 3 1 2 3 3
## [60,80) 4 4 3 1 2
## [40,60) 3 5 3 7 3
## [20,40) 1 1 2 3 3
## [0,20) 1 2 2 4 5
By default the function divides the data into a 5 x 5 grid of cells. The event count in each cell is produced. To change the default number of cells in x and y directions you use the nx = and ny = arguments.
quadratcount(swedishpines,
nx = 2,
ny = 3)## x
## y [0,48) [48,96]
## [66.7,100] 10 11
## [33.3,66.7) 14 14
## [0,33.3) 7 15
The plot method applied to the results of the quadratcount() functions adds the counts to a plot. Here you add the counts and include the event locations.
plot(quadratcount(swedishpines))
plot(swedishpines, pty = 19, col = "red",
add = TRUE, main = "")
Grid cell areas will not be all equal when the domain boundaries are irregular like with the demopat ppp object.
plot(quadratcount(demopat))
Areas near the borders are smaller than areas completely within the domain.
When the number of events is large, hexagon grid cells provide a useful alternative to rectangular grid cells.
The process is: (1) tessellate the domain by a regular grid of hexagons, (2) count the number of events in each hexagon, and (3) use a color ramp to display the events per hexagon.
As an example here you generate 20K random values from the standard normal distribution for the x coordinate and the same number of random values for the y coordinate. You then use the hexbin() function from the {hexbin} package and specify 10 bins in the x direction to count the number of events in each hexagon and assign the result to the object hbin.
if(!require(hexbin)) install.packages(pkgs = "hexbin", repos = "http://cran.us.r-project.org")## Loading required package: hexbin
x <- rnorm(20000)
y <- rnorm(20000)
hbin <- hexbin::hexbin(x, y, xbins = 10)
str(hbin)## Formal class 'hexbin' [package "hexbin"] with 16 slots
## ..@ cell : int [1:93] 5 15 16 17 24 25 26 27 28 29 ...
## ..@ count : int [1:93] 1 1 4 1 1 5 13 33 23 7 ...
## ..@ xcm : num [1:93] -0.0113 -0.8598 0.1888 0.8585 -2.5488 ...
## ..@ ycm : num [1:93] -4.32 -3.38 -3.69 -3.45 -2.75 ...
## ..@ xbins : num 10
## ..@ shape : num 1
## ..@ xbnds : num [1:2] -3.63 4.52
## ..@ ybnds : num [1:2] -4.32 3.88
## ..@ dimen : num [1:2] 14 11
## ..@ n : int 20000
## ..@ ncells: int 93
## ..@ call : language hexbin::hexbin(x = x, y = y, xbins = 10)
## ..@ xlab : chr "x"
## ..@ ylab : chr "y"
## ..@ cID : NULL
## ..@ cAtt : int(0)
The {hexbin} package uses S4 data classes so the output is stored in slots. Use the plot() method to make a graph.
plot(hbin)
Hexagons have symmetric nearest neighbors (there is only rook contiguity). They have the most sides of any polygon that can tessellate the plane. They are generally more efficient than rectangles at covering the events. In other words it takes fewer of them to cover the same number of events. They are visually less biased for displaying local event intensity compared to squares/rectangles.
Here you generate a large number of random events in the two-dimensional plane. Use a normal distribution in the x-direction and a student t-distribution in the y-direction.
set.seed(131)
x <- rnorm(7777)
y <- rt(7777, df = 3)
hbin2 <- hexbin::hexbin(x, y, xbins = 25)
plot(hbin2)
The {ggplot2} package has the stat_binhex() function so that also can be used for display.
df <- data.frame(x, y)
ggplot(data = df,
mapping = aes(x, y)) +
stat_binhex()
Another way to quantify the spatial intensity is with kernel density estimation (KDE). Let \(s_i\) for \(i\) = 1…\(n\) be event locations, then an estimate for the intensity of the events at any location \(s\) is given by
\[ \hat \lambda (s) = \frac{1}{nh}\sum_{i=1}^nK\Big(\frac{s - s_i}{h}\Big) \] where \(K()\) is the kernel function and \(h > 0\) is a smoothing parameter called the bandwidth. Typically the kernel function is a Gaussian probability density function.
To help visualize KDE you generate 25 event locations (el) uniformly on the real number line representing a one-dimensional spatial domain between 0 and 1 and then use kernel density estimation to get a continuous intensity function. The density estimation is is done using the function density() and here you compare the intensity function for increasing bandwidths specified with the bw = argument.
el <- runif(25)
dd1 <- density(el, bw = .025)
dd2 <- density(el, bw = .05)
dd3 <- density(el, bw = .1)
df <- data.frame(x = c(dd1$x, dd2$x, dd3$x),
y = c(dd1$y, dd2$y, dd3$y),
bw = c(rep("h = .025", 512),
rep("h = .05", 512),
rep("h = .1", 512)))
df2 <- data.frame(x = el, y = 0)
ggplot(data = df,
mapping = aes(x, y)) +
geom_line() +
facet_wrap(~ bw, nrow = 3) +
geom_point(mapping = aes(x, y),
data = df2,
color = "red")
As the bandwidth increases the curve (black line) representing the local intensity becomes smoother. The intensity is estimated at every location, not just at the location of the event.
The density is a summation of the kernels with one kernel centered on top of each event location. Event locations are marked with a point along the x-axis and the kernel is a Gaussian probability density function. The kernel is placed on each event and the bandwidth specifies the distance between the inflection points of the kernel. The one-dimensional KDE extends to two (or more) dimensions.
Example: The distribution of trees in a tropical forest
The object bei is a planar point pattern object from the {spatstat} package containing the locations of trees in a tropical rain forest.
summary(bei)## Planar point pattern: 3604 points
## Average intensity 0.007208 points per square metre
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 metres
##
## Window: rectangle = [0, 1000] x [0, 500] metres
## Window area = 5e+05 square metres
## Unit of length: 1 metre
There are 3604 events (trees) over an area of 500,000 square meters giving an average intensity of .0073 trees per unit area.
The distribution of trees is not uniform (heterogeneous) as can be seen with a plot
plot(bei)
The plot shows clusters of trees and large areas with few if any trees.
Elevation and elevation slope are factors associated with tree occurrence.
The point pattern data is accompanied by data (bei.extra) on elevation (elev) and slope of elevation (grad) across the region.
plot(bei.extra)
These data are stored as im (image) objects.
class(bei.extra$elev)## [1] "im"
The image object contains a list with 10 elements including the matrix of values (v).
str(bei.extra$elev)## List of 10
## $ v : num [1:101, 1:201] 121 121 121 121 121 ...
## $ dim : int [1:2] 101 201
## $ xrange: num [1:2] -2.5 1002.5
## $ yrange: num [1:2] -2.5 502.5
## $ xstep : num 5
## $ ystep : num 5
## $ xcol : num [1:201] 0 5 10 15 20 25 30 35 40 45 ...
## $ yrow : num [1:101] 0 5 10 15 20 25 30 35 40 45 ...
## $ type : chr "real"
## $ units :List of 3
## ..$ singular : chr "metre"
## ..$ plural : chr "metres"
## ..$ multiplier: num 1
## ..- attr(*, "class")= chr "unitname"
## - attr(*, "class")= chr "im"
Specifying a spatial domain (window) allows focuses the analysis on a particular region. Suppose you want to model locations of a certain tree type, but only for trees located at elevations above 145 meters. The levelset() function creates a window from an image object using thresh = and compare = arguments.
W <- levelset(bei.extra$elev,
thresh = 145,
compare = ">")
class(W)## [1] "owin"
The result is an object of class owin. The plot method displays the window as a mask, which is the region in black.
plot(W)
You subset the ppp object by the window using the bracket operator ([]). Here you assign the reduced ppp object to beiW and then make a plot.
beiW <- bei[W]
plot(beiW)
Now the analysis window is white and the event locations are plotted on top.
As another example you create a window where altitude is lower than 145 m and slope exceeds .1 degrees. In this case you use the solutionset() function.
V <- solutionset(bei.extra$elev <= 145 &
bei.extra$grad > .1)
beiV <- bei[V]
plot(beiV)
You compute the spatial intensity function over the domain with the density() method using the default Gaussian kernel and fixed bandwidth determined by the window size.
beiV |>
density() |>
plot()
The units of intensity are events per unit area (here square meters). The intensity values are computed on a grid (\(v\)) and are returned as a pixel image.
There are over 16K of the cells that have a value of NA as a result of the masking by elevation and slope.
den <- beiV |>
density()
sum(is.na(den$v))## [1] 16020
Thursday October 20, 2022
“So much complexity in software comes from trying to make one thing do two things.” - Ryan Singer
Today
- Creating
pppandowinobjects from simple feature data frames - Estimating spatial intensity as a function of distance
- Intensity trend as a possible confounding factor
Last time the terminology of point pattern data including the concept of complete spatial randomness (CSR) was introduced. Focus is typically on natural occurring systems where the spatial location of events is examined through the lens of statistics in an attempt to understand physical processes.
The {spatstat} package is a comprehensive set of functions for analyzing, plotting, and modeling point pattern data. The package requires the data be of spatial class ppp.
The typical work flow includes importing and munging data as a simple feature data frame and then converting the simple feature data frame to a ppp object for analysis and modeling. But it is sometimes convenient to do some of the data munging after conversion to a ppp object.
Creating ppp and owin objects from simple feature data frames
Consider again Kansas tornadoes. Import the data as a simple feature data frame and transform the geographic CRS to Lambert conic conformal centered on Kansas (EPSG:6922). Keep all tornadoes (having an EF damage rating) since 1950 whose initial location occurred within Kansas.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
sf::st_transform(crs = 6922) |>
dplyr::filter(st == "KS", mag >= 0) |>
dplyr::mutate(EF = factor(mag)) |>
dplyr::select(EF)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
Torn.sf |>
head()## Simple feature collection with 6 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 308290.4 ymin: 172079 xmax: 610402.8 ymax: 427368.7
## Projected CRS: NAD83 / Kansas LCC
## EF geometry
## 1 1 POINT (610402.8 238069.8)
## 2 2 POINT (384977.9 172079)
## 3 3 POINT (525374.2 236300.3)
## 4 0 POINT (395720.4 427368.7)
## 5 0 POINT (523250.6 202968.5)
## 6 0 POINT (308290.4 416796.4)
The length unit is meters. This can be seen by printing the CRS.
sf::st_crs(Torn.sf)## Coordinate Reference System:
## User input: EPSG:6922
## wkt:
## PROJCRS["NAD83 / Kansas LCC",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["Kansas DOT Lambert (meters)",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",36,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-98.25,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",39.5,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",37.5,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",400000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Topographic mapping (small scale)."],
## AREA["United States (USA) - Kansas."],
## BBOX[36.99,-102.06,40.01,-94.58]],
## ID["EPSG",6922]]
Further you note that some tornadoes are incorrectly coded as Kansas tornadoes by plotting the event locations.
plot(Torn.sf$geometry)
You recognize the large number of events within the near-rectangle shape of the Kansas border but you also see a few events clearly outside.
Instead of filtering by column name (st == "KS") you can subset by geometry using the sf::st_intersection() function. Here, since we are using the functions in the {spatstat} package, you do this by defining the state border as an owin object.
You get the Kansas border as a simple feature data frame from the {USAboundaries} package transforming the CRS to that of the tornadoes.
KS.sf <- USAboundaries::us_states(states = "Kansas") |>
sf::st_transform(crs = sf::st_crs(Torn.sf))You then create an owin object from the simple feature data frame using the as.owin() function.
suppressMessages(library(spatstat))
KS.win <- KS.sf |>
as.owin()Next you convert the simple feature data frame of tornado reports to a ppp object with the EF damage rating as the marks using the as.ppp() function.
T.ppp <- Torn.sf |>
as.ppp()
plot(T.ppp)
Finally you subset the event locations in the ppp object by the Kansas border using the subset operator ([]).
T.ppp <- T.ppp[KS.win]
plot(T.ppp)
With the T.ppp object you are ready to analyze the tornado locations as spatial point pattern data.
The summary() method summarizes information in the ppp object.
summary(T.ppp)## Marked planar point pattern: 4281 points
## Average intensity 2.008277e-08 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Multitype:
## frequency proportion intensity
## 0 2534 0.591917800 1.188735e-08
## 1 1062 0.248072900 4.981990e-09
## 2 453 0.105816400 2.125086e-09
## 3 188 0.043914970 8.819343e-10
## 4 38 0.008876431 1.782633e-10
## 5 6 0.001401542 2.814684e-11
##
## Window: polygonal boundary
## single connected closed polygon with 169 vertices
## enclosing rectangle: [62446.7, 723275.4] x [110798.6, 451072.2] units
## (660800 x 340300 units)
## Window area = 2.13168e+11 square units
## Fraction of frame area: 0.948
The output tells you that there are 4281 events (tornado reports) with an average spatial intensity of .0000000201 (2.008277e-08) events per unit area.
The distance unit is meter since that is the length unit in the simple feature data frame (see sf::st_crs(Torn.sf) LENGTHUNIT[“metre”,1]) from which the ppp object was derived. So the area is in square meters making the spatial intensity (number of tornado reports per square meter) quite small.
To make it easier to interpret the intensity convert the length unit from meters to kilometers within the ppp object with the rescale() function from the {spatstat} package (spatstat.geom). The scaling factor argument is s = and the conversion is 1000 m = 1 km so the argument is set to 1000. You then set the unit name to km with the unitname = argument.
T.ppp <- T.ppp |>
spatstat.geom::rescale(s = 1000,
unitname = "km")
summary(T.ppp)## Marked planar point pattern: 4281 points
## Average intensity 0.02008277 points per square km
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 5 decimal places
##
## Multitype:
## frequency proportion intensity
## 0 2534 0.591917800 1.188735e-02
## 1 1062 0.248072900 4.981990e-03
## 2 453 0.105816400 2.125086e-03
## 3 188 0.043914970 8.819343e-04
## 4 38 0.008876431 1.782633e-04
## 5 6 0.001401542 2.814684e-05
##
## Window: polygonal boundary
## single connected closed polygon with 169 vertices
## enclosing rectangle: [62.4467, 723.2754] x [110.7986, 451.0722] km
## (660.8 x 340.3 km)
## Window area = 213168 square km
## Unit of length: 1 km
## Fraction of frame area: 0.948
Caution here as you are recycling the object name T.ppp. If you rerun the above code chunk the scale will change again by a factor of 1000 while the unit name will stay the same.
There are 4281 tornado reports with an average intensity of .02 tornadoes per square km over this time period. Nearly 60% of all Kansas tornadoes are EF0. Less than 1% of them are categorized as ‘violent’ (EF4 or EF5). The area of the state is 213,168 square kilometers (km).
Plot the events separated by the marks using the plot() method together with the split() function.
T.ppp |>
split() |>
plot()
Relative to the less damaging tornadoes there are far fewer EF4 and EF5 events.
Can the spatial distribution of Kansas tornadoes be described by complete spatial randomness?
The number of tornadoes varies across the state (EF4 tornadoes are rare in the far western part of the state for example) but it’s difficult to say whether this is due to sampling variation. To illustrate this here you compare the EF1 tornado locations with a sample of events generated under the null hypothesis of CSR.
First create Y as an unmarked ppp object containing only the EF1 tornadoes. You do this by keeping only the events with marks equal to one with the subset() function. Since the marks are a factor you remove the levels with the unmark() function.
( Y <- T.ppp |>
subset(marks == 1) |>
unmark() )## Planar point pattern: 1062 points
## window: polygonal boundary
## enclosing rectangle: [62.4467, 723.2754] x [110.7986, 451.0722] km
There were 1062 reported EF1 tornadoes originating within the state over the period 1950 through 2020.
The average intensity of the EF1 tornado events is obtained with the intensity() function.
intensity(Y)## [1] 0.00498199
On average there has been .005 EF1 tornadoes per square km or 50 per 100 square km.
Make a map to check if things look right.
plot(Y)
EF1 tornado reports are found throughout the state and they appear to be distributed randomly.
Formally: Is the spatial distribution of EF1 tornado reports consistent with a set of event locations that are described as complete spatial randomness?
To help answer this question you construct X to be a set of events generated from a homogeneous Poisson process (a model for CSR) where the intensity of the events is equal to the average intensity of the EF1 tornado reports.
You assign the average intensity to an object called lambdaEF1 and then use rpoispp() (random Poisson point pattern) with lambda set to that intensity and the domain specified with the win = argument.
( lambdaEF1 <- intensity(Y) )## [1] 0.00498199
( X <- rpoispp(lambda = lambdaEF1,
win = window(Y)) )## Planar point pattern: 1097 points
## window: polygonal boundary
## enclosing rectangle: [62.4467, 723.2754] x [110.7986, 451.0722] km
The average intensity of X matches (closely) the average intensity of Y by design and the plot() method reveals a similar looking pattern of event locations.
intensity(X)## [1] 0.00514618
plot(X)
While the pattern is similar, there does appear to be a difference. Can you describe the difference?
To make comparisons between the two point pattern data (one observed events and the other simulated) easier you use the superimpose() function to create a single ppp object and assign to Z marks Y and X. Then plot the two intensity rasters split by mark type.
Z <- superimpose(Y = Y,
X = X)## Warning: data contain duplicated points
Z |>
split() |>
density() |>
plot()
The range of local intensity variations is similar. So we don’t have much evidence against the null model of CSR as defined by a homogeneous Poisson process.
Estimating spatial intensity as a function of distance
Are tornado reports more common in the vicinity of towns?
Based on domain specific knowledge of how these data were collected you suspect that tornado reports will cluster near cities and towns. This is especially true in the earlier years of the record.
This understanding is available from the literature on tornadoes (not from the data) and it is a well-known artifact of the data set, but it had never been quantified until 2013 in a paper we wrote. http://myweb.fsu.edu/jelsner/PDF/Research/ElsnerMichaelsScheitlinElsner2013.pdf.
How was this done? You estimate of the spatial intensity of the observed tornado reports as a function of distance from nearest town and compare that estimate with an estimate of the spatial intensity as a function of distance using randomly placed events across the state.
First get the city locations from the us_cities() function in the {USAboundaries} package. Exclude towns with fewer than 1000 people and transform the geometry to that of the tornado locations.
C.sf <- USAboundaries::us_cities() |>
dplyr::filter(population >= 1000) |>
sf::st_transform(crs = sf::st_crs(Torn.sf))## City populations for contemporary data come from the 2010 census.
Create a ppp object of events from the city/town locations in the simple feature data frame. Remove the marks and include only events inside the window object (KS.own). Convert the distance unit from meters to kilometers.
C.ppp <- C.sf |>
as.ppp() |>
unmark()## Warning in as.ppp.sf(C.sf): only first attribute column is used for marks
C.ppp <- C.ppp[KS.win] |>
spatstat.geom::rescale(s = 1000,
unitname = "km")
plot(C.ppp)
Next compute a ‘distance map’. A distance map for a spatial domain A is a function \(f(s)\) whose value is defined for any point \(s\) as the shortest distance from \(s\) to any event location in A.
This is done with the distmap() function and the points are the intersections of a 128 x 128 rectangular grid.
Zc <- distmap(C.ppp)
plot(Zc)
The result is an object of class im (image raster). Distances are in kilometers. Most points in Kansas are less than 50 km from the nearest town (reds and blues) but some points are more than 80 km away (yellow).
Other distance functions include pairdist(), which is the pairwise distance between all event pairs and crossdist(), which is the distance between events from two point patterns. The nndist() computes the distance between an event and its nearest neighbor event.
The distance map (distance from any point in Kansas to the nearest town) is used to quantify the population bias in the tornado records.
This is done with rhohat() which estimates the smoothed spatial intensity as a function of some explanatory variable. The relationship between spatial intensity and an explanatory variable is sometimes called a ‘resource selection’ function (if the events are organisms and the variable is a descriptor of habitat) or a ‘prospectivity index’ (if the events are mineral deposits and the variable is a geological variable).
The method assumes the events are a realization from a Poisson process with intensity function \(\lambda(u)\) of the form
\[ \lambda(u) = \rho[Z(u)] \]
where \(Z\) is the spatial explanatory variable (covariate) function (with continuous values) and \(\rho(z)\) is a function to be estimated.
The function does not assume a particular form for the relationship between the point pattern and the variable (thus it is said to be ‘non-parametric’).
Here you use rhohat() to estimate tornado report intensity as a function of distance to nearest city.
The first argument in rhohat() is the ppp object for which you want the intensity estimate and the covariate = argument is the spatial variable, here as object of class im. By default kernel smoothing is done using a fixed bandwidth. With method = "transform" a variable bandwidth is used.
rhat <- rhohat(Y,
covariate = Zc,
method = "transform")
class(rhat)## [1] "rhohat" "fv" "data.frame"
The resulting object (rhat) has three classes including a data frame. The data frame contains the explanatory variable as a single vector (Zc), an estimate of the intensity at the distances (rho), the variance (var) and upper (hi) and lower (lo) uncertainty values (point-wise).
rhat |>
data.frame() |>
head()## Zc rho ave var hi lo
## 1 0.06485354 0.008502416 0.00498199 3.757291e-07 0.009703810 0.007301021
## 2 0.25954706 0.008488610 0.00498199 3.734674e-07 0.009686383 0.007290838
## 3 0.45424058 0.008473656 0.00498199 3.710128e-07 0.009667487 0.007279826
## 4 0.64893409 0.008456981 0.00498199 3.682946e-07 0.009646430 0.007267532
## 5 0.84362761 0.008438888 0.00498199 3.653497e-07 0.009623571 0.007254204
## 6 1.03832112 0.008419443 0.00498199 3.621863e-07 0.009598987 0.007239899
Here you put these values into a new data frame (df) multiplying the intensities by 10,000 (so areal units are 100 sq. km) then use ggplot() method with a geom_ribbon() layer to overlay the uncertainty band.
df <- data.frame(dist = rhat$Zc,
rho = rhat$rho * 10000,
hi = rhat$hi * 10000,
lo = rhat$lo * 10000)
library(ggplot2)
ggplot(data = df) +
geom_ribbon(mapping = aes(x = dist, ymin = lo , ymax = hi), alpha = .3) +
geom_line(mapping = aes(x = dist, y = rho), color = "red") +
geom_hline(yintercept = intensity(Y) * 10000, color = "blue") +
scale_y_continuous(limits = c(0, 100)) +
ylab("Tornado reports (EF1) per 100 sq. km") +
xlab("Distance from nearest town center (km)") +
theme_minimal()
The vertical axis on the plot is the tornado report intensity in units of number of reports per 100 square kilometers. The horizontal axis is the distance to nearest town in km. The red line is the average spatial intensity as a function of distance from nearest town. The 95% uncertainty band about this estimate is shown in gray.
At points close to the town center tornado reports are high relative to at points far from town. The blue line is the average intensity across the state computed with the intensity() function and scaled appropriately. At points within about 15 km the tornado report intensity is above the statewide average intensity. At points greater than about 60 km the report intensity is below the statewide average.
At zero distance from a town, this number is more than 1.7 times higher (85 tornadoes per 100 sq. km). The spatial scale is about 15 km (distance along the spatial axis where the red line falls below the blue line).
At this point in the analysis you need to think that although the plot look reasonable based on your expectations of a population bias in the tornado reports (more reports near cities/towns), could this result be an artifact of the smoothing algorithm?
You need to know how to apply statistical tools to accomplish specific tasks. But you also need to be a bit skeptical of the tool’s outcome. The skepticism provides a critical check against being fooled by randomness.
As an example, the method of computing the spatial intensity as a function of a covariate should give you a different answer if events are randomly distributed. If the events are randomly distributed, what would you expect to find on a plot such as this?
You already generated a set of events from a homogeneous Poisson model so you can check simply by applying the rhohat() function to these events using the same set of city/town locations.
rhat0 <- rhohat(X,
covariate = Zc,
method = "transform")
df <- data.frame(dist = rhat0$Zc,
rho = rhat0$rho * 10000,
hi = rhat0$hi * 10000,
lo = rhat0$lo * 10000)
ggplot(df) +
geom_ribbon(aes(x = dist, ymin = lo , ymax = hi), alpha = .3) +
geom_line(aes(x = dist, y = rho), color = "red") +
geom_hline(yintercept = intensity(Y) * 10000, color = "blue") +
scale_y_continuous(limits = c(0, 100)) +
ylab("Random events per 100 sq. km") +
xlab("Distance from nearest town center (km)") +
theme_minimal()
As expected, the number of random events near cities/towns is not higher than the number of random events at greater distances. The difference between the two point pattern data sets can be explained by the clustering of actual tornado reports in the vicinity of towns.
Intensity trend as a possible confounding factor
Quantifying the report bias with the spatial intensity function works well for Kansas where there is no trend in the local intensity. Local tornado intensity is largely uniform across Kansas.
Things are different in Texas where a significant intensity trend makes it more difficult to estimate the report bias.
Convert the tornado reports (EF1 or worse) occurring over Texas as a ppp object. Use a Texas-centric Lambert conic conformal projection.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
sf::st_transform(crs = 3082) |>
dplyr::filter(mag >= 0)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
T.ppp <- Torn.sf |>
as.ppp()## Warning in as.ppp.sf(Torn.sf): only first attribute column is used for marks
W <- USAboundaries::us_states(states = "Texas") |>
sf::st_transform(crs = sf::st_crs(Torn.sf)) |>
as.owin()
( T.ppp <- T.ppp[W] |>
spatstat.geom::rescale(s = 1000,
unitname = "km") )## Marked planar point pattern: 8932 points
## marks are numeric, of storage type 'double'
## window: polygonal boundary
## enclosing rectangle: [873.7638, 2116.6498] x [5881.245, 7063.086] km
intensity(T.ppp)## [1] 0.01293119
There are 8,932 tornado reports. The distance unit is kilometer. The average intensity is .013 events per square kilometer over this 71-year period (1950-2020).
Next plot the local intensity using a kernel smoother.
T.ppp |>
density() |>
plot()
There is a clear trend of tornado reports from a low number of reports in the southwest part of the state along the Rio Grande to a high number of reports in the northeast part of the state. The statewide average intensity of .013 tornado reports per square km is too high in southwest and too low in the northern.
Next compute and plot the spatial intensity as a smoothed function of distance to nearest town or city. Start by removing the marks on the tornado events assigning the unmarked ppp object to Tum.ppp. Then create a ppp object from the city/town locations and subset the tornado events by the window.
Tum.ppp <- T.ppp |>
unmark()
C.ppp <- C.sf |>
sf::st_transform(crs = sf::st_crs(Torn.sf)) |>
as.ppp() |>
unmark()## Warning in as.ppp.sf(sf::st_transform(C.sf, crs = sf::st_crs(Torn.sf))): only
## first attribute column is used for marks
C.ppp <- C.ppp[W] |>
spatstat.geom::rescale(s = 1000,
unitname = "km")Next create a distance map of the city/town locations using the distmap() function.
Zc <- distmap(C.ppp)
plot(Zc)
Finally, compute the intensity of tornadoes as a smoothed function of distance to nearest town/city with the rhohat() function. Prepare the output and make a plot.
rhat <- rhohat(Tum.ppp,
covariate = Zc,
method = "transform")
data.frame(dist = rhat$Zc,
rho = rhat$rho,
hi = rhat$hi,
lo = rhat$lo) |>
ggplot() +
geom_ribbon(aes(x = dist, ymin = lo , ymax = hi), alpha = .3) +
geom_line(aes(x = dist, y = rho), color = "red") +
scale_y_continuous(limits = c(0, NA)) +
geom_hline(yintercept = intensity(Tum.ppp), color = "blue") +
ylab("Tornado reports per sq. km") +
xlab("Distance from nearest town center (km)") +
theme_minimal()
The plot shows that the intensity of the tornado reports is much higher than the average intensity in the vicinity of towns and cities. Yet caution needs to exercised in the interpretation because the trend of increasing tornado reports moving from southwest to northeast across the state is mirrored by the trend in the occurrence of cities/towns. There are many fewer towns in the southwestern part of Texas compared to the northern and eastern part of the state.
You can quantify this effect by specifying a function in the covariate = argument. Here you specify a planar surface with x,y as arguments and x + y inside the function. Here you use the plot() method on the output (instead of creating a data frame and using ggplot()).
plot(rhohat(Tum.ppp,
covariate = function(x,y){x + y},
method = "transform"),
main = "Spatial intensity trend of tornadoes")
Local intensity increases along the axis labeled X starting at a value of 7,400. At value of X equal to about 8,200 the spatial intensity stops increasing.
Units along the horizontal axis are kilometers but the reference (intercept) distance is at the far left. So you interpret the increase in spatial intensity going from southwest to northeast as a change across about 800 km (8200 - 7400).
The local intensity of cities has the same property (increasing from southwest to northeast then leveling off). Here you substitute C.ppp for Tum.ppp in the rhohat() function.
plot(rhohat(C.ppp,
covariate = function(x,y){x + y},
method = "transform"),
main = "Spatial intensity trend of cities")
So the population bias towards more reports near towns/cities is potentially confounded by the fact that there tends to be more cities and towns in areas that have conditions more favorable for tornadoes.
Thus you can only get so far by examining intensity estimates. If your interest lies in inferring the causes of spatial variation in the intensity you will need to look at second order (clustering) properties of the events.
Tuesday October 25, 2022
“To me programming is more than an important practical art. It is also a gigantic undertaking in the foundations of knowledge.” – Grace Hopper
Today
- Estimating the relative risk of events
- Estimating second-order properties of spatial events
Estimating the relative risk of events
Separate spatial intensity maps across two marked types provides a way to estimate the risk of one event type conditional on the other event type. More generally, the relative risk of occurrence of some event is a conditional probability. In a non-spatial context, the risk of catching a disease if you are elderly relative to the risk if you are young.
Given a tornado somewhere in Texas what is the chance that it will cause at least EF3 damage? With the historical set of all tornadoes marked by the damage rating you can make a map of all tornadoes and a map of the EF3+ tornadoes and then take the ratio.
To see this start by importing the tornado data, mutating and selecting the damage rating as a factor called EF before turning the resulting simple feature data frame into a planar point pattern.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
sf::st_transform(crs = 3082) |>
dplyr::filter(mag >= 0) |>
dplyr::mutate(EF = as.factor(mag)) |>
dplyr::select(EF)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
library(spatstat)
T.ppp <- Torn.sf |>
as.ppp()Then subset by the boundary of Texas.
TX.sf <- USAboundaries::us_states(states = "Texas") |>
sf::st_transform(crs = sf::st_crs(Torn.sf))
W <- TX.sf |>
as.owin()
T.ppp <- T.ppp[W]
summary(T.ppp)## Marked planar point pattern: 8932 points
## Average intensity 1.293119e-08 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## Multitype:
## frequency proportion intensity
## 0 4773 0.5343708000 6.910052e-09
## 1 2557 0.2862741000 3.701865e-09
## 2 1225 0.1371473000 1.773479e-09
## 3 323 0.0361621100 4.676192e-10
## 4 48 0.0053739360 6.949141e-11
## 5 6 0.0006717421 8.686426e-12
##
## Window: polygonal boundary
## single connected closed polygon with 550 vertices
## enclosing rectangle: [873763.8, 2116649.8] x [5881245, 7063086] units
## (1243000 x 1182000 units)
## Window area = 6.90733e+11 square units
## Fraction of frame area: 0.47
Chance that a tornado anywhere in Texas will be at least EF3 or worse is the sum of the proportions for these types: .03616 + .00537 + .00067 = .042 (or 4.2%).
As found previously there is a spatial intensity gradient across the state with fewer tornadoes in the southwest and more in the northeast. Also the more damaging tornadoes might be more common relative to all tornadoes in some parts of the state compared with other parts.
To create a map of the relative risk of the more damaging tornadoes you start by making two ppp objects, one being the set of all tornado events with damage ratings 0, 1, or 2 and the other the set of all tornado locations with damage ratings 3, 4, or 5. You do this by subset the object using brackets ([]) and the logical operator | (or) and then merge the two subsets assigning names H and I as marks with the superimpose() function.
H.ppp <- unmark(T.ppp[T.ppp$marks == 2 | T.ppp$marks == 1 | T.ppp$marks == 0])
I.ppp <- unmark(T.ppp[T.ppp$marks == 3 | T.ppp$marks == 4 | T.ppp$marks == 5])
T2.ppp <- superimpose(H = H.ppp,
I = I.ppp)## Warning: data contain duplicated points
See https://en.wikipedia.org/wiki/Enhanced_Fujita_scale for definitions of EF tornado rating.
The chance that a tornado chosen at random is intense (EF3+) is 4.2%. Plot the event locations for the set of intense tornadoes.
plot(I.ppp,
pch = 25,
cols = "red",
main = "")
plot(T.ppp, add = TRUE, lwd = .1)
To get the relative risk use the relrisk() function. If X is a multi-type point pattern with factor marks and two levels of the factor then the events of the first type (the first level of marks(X)) are treated as controls (conditionals) or non-events, and events of the second type are treated as cases.
The relrisk() function estimates the local chance of a case (i.e. the probability \(p(u)\) that a point at \(u\) will be a case) using a kernel density smoother. The bandwidth for the kernel is specified or can be found through an iterative cross-validation procedure (recall the bandwidth selection procedure used in geographic regression) using the bw.relrisk() function.
The bandwidth has units of length (here meters). You specify a minimum and maximum bandwidth with the hmin = and hmax = arguments. This takes a few seconds.
( bw <- bw.relrisk(T2.ppp,
hmin = 1000,
hmax = 200000) )## sigma
## 119770.4
The optimal bandwidth (sigma) is 119770 meters or about 120 km.
Now estimate the relative risk at points defined by a 256 by 256 grid and using the 120 km bandwidth for the kernel smoother.
rr <- relrisk(T2.ppp,
sigma = bw,
dimyx = c(256, 256))The result is an object of class im (image) with values you interpret as the conditional probability of an ‘intense’ tornado.
You retrieve the range of probabilities with the range() function. Note that many of the values are NA corresponding pixels that are outside the window so you set the na.rm argument to TRUE.
range(rr, na.rm = TRUE)## [1] 0.005003694 0.060170214
The probabilities range from a low of .5% to a high of 6%. This range compares with the statewide average probability of 4.2%.
Map the probabilities with the plot() method.
plot(rr)
Make a better map by converting the image to a raster, setting the CRS, and then using functions from the {tmap} package.
tr.r <- raster::raster(rr)
raster::crs(tr.r) <- sf::st_crs(Torn.sf)$proj4string
tmap::tm_shape(tr.r) +
tmap::tm_raster()
The chance that a tornado is more damaging peaks in the northeast part of the state.
Since the relative risk is computed for any point it is of interest to extract the probabilities for cities and towns.
You get city locations with the us_cities() function from the {USAboundaries} package that extracts a simple feature data frame of cities. The CRS is 4326 and you filter to keep only cities with at least 100000 in 2010.
Cities.sf <- USAboundaries::us_cities(state = "TX") |>
sf::st_transform(crs = raster::crs(tr.r)) |>
dplyr::filter(population > 100000)## City populations for contemporary data come from the 2010 census.
Use the extract() function from the {raster} package to get a single value for each city. Put these values into the simple feature data frame.
Cities.sf$tr <- raster::extract(tr.r,
Cities.sf)
Cities.sf |>
dplyr::arrange(desc(tr)) ## Simple feature collection with 29 features and 13 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 893306.2 ymin: 5900924 xmax: 2063028 ymax: 6916010
## CRS: PROJCRS["unknown",
## BASEGEOGCRS["unknown",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6269]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]]],
## CONVERSION["unknown",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",18,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-100,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",27.5,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",35,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",1500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",5000000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]]
## # A tibble: 29 × 14
## city state_name state_abbr county county_name stplfips_2010 name_2010
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Mckinney Texas TX COLLIN Collin 4845744 McKinney…
## 2 Mesquite Texas TX Dallas Dallas 4847892 Mesquite…
## 3 Garland Texas TX Dallas Dallas 4829000 Garland …
## 4 Plano Texas TX Collin Collin 4858016 Plano ci…
## 5 Frisco Texas TX Collin Collin 4827684 Frisco c…
## 6 Dallas Texas TX DALLAS Dallas 4819000 Dallas c…
## 7 Carrollton Texas TX Dallas Dallas 4813024 Carrollt…
## 8 Irving Texas TX Dallas Dallas 4837000 Irving c…
## 9 Denton Texas TX DENTON Denton 4819972 Denton c…
## 10 Grand Prair… Texas TX Dallas Dallas 4830464 Grand Pr…
## # … with 19 more rows, and 7 more variables: city_source <chr>,
## # population_source <chr>, place_type <chr>, year <int>, population <int>,
## # geometry <POINT [m]>, tr <dbl>
To illustrate the results create a graph using the geom_lollipop() function from the {ggalt} package. Use the package {scales} to allow for labels in percent.
library(ggalt)## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
library(scales)##
## Attaching package: 'scales'
## The following object is masked from 'package:spatstat.geom':
##
## rescale
ggplot(Cities.sf, aes(x = reorder(city, tr), y = tr)) +
geom_lollipop(point.colour = "steelblue", point.size = 3) +
scale_y_continuous(labels = percent, limits = c(0, .0625)) +
coord_flip() +
labs(x = "", y = NULL,
title = "Historical chance that a tornado caused at least EF3 damage",
subtitle = "Cities in Texas with a 2010 population > 100,000",
caption = "Data from SPC (1950-2020)") +
theme_minimal()
Another example: Florida wildfires
Given a wildfire in Florida what is the probability that it was started by lightning?
Import wildfire data (available here: https://www.fs.usda.gov/rds/archive/catalog/RDS-2013-0009.4) as a simple feature data frame and transform the native CRS to a Florida GDL Albers (EPSG 3086).
if(!"FL_Fires" %in% list.files(here::here("data"))){
download.file("http://myweb.fsu.edu/jelsner/temp/data/FL_Fires.zip",
destfile = here::here("data", "FL_Fires.zip"))
unzip(zipfile = here::here("data", "FL_Fires.zip"),
exdir = here::here("data"))
}
FL_Fires.sf <- sf::st_read(dsn = here::here("data", "FL_Fires")) |>
sf::st_transform(crs = 3086)## Reading layer `FL_Fires' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/FL_Fires'
## using driver `ESRI Shapefile'
## Simple feature collection with 90261 features and 37 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -9750382 ymin: 2824449 xmax: -8908899 ymax: 3632749
## Projected CRS: Mercator_2SP
dim(FL_Fires.sf)## [1] 90261 38
Each row is a unique fire and the data spans the period 1992-2015. There are over 90K rows and 38 variables.
To make things run faster, here you analyze only a random sample of all the data. You do this with the dplyr::sample_n() function where the argument size = specifies the number of rows to choose at random. Save the sample of events to the object FL_FiresS.sf. First set the seed for the random number generator so that the set of rows chosen will be the same every time you run the code.
set.seed(78732)
FL_FiresS.sf <- FL_Fires.sf |>
dplyr::sample_n(size = 2000)
dim(FL_FiresS.sf)## [1] 2000 38
The result is a simple feature data frame with exactly 2000 rows.
The character variable STAT_CAU_1 indicates the cause of the wildfire.
FL_FiresS.sf$STAT_CAU_1 |>
table()##
## Arson Campfire Children Debris Burning
## 147 32 93 239
## Equipment Use Fireworks Lightning Miscellaneous
## 308 4 495 199
## Missing/Undefined Powerline Railroad Smoking
## 74 9 365 31
## Structure
## 4
There are 13 causes (listed in alphabetical order) with various occurrence frequencies. Lightning is the most common.
To analyze these data as spatial events, you first convert the simple feature data to a ppp object over a window defined by the state boundaries. Use the cause of the fire as a factor mark.
F.ppp <- FL_FiresS.sf["STAT_CAU_1"] |>
as.ppp()
W <- USAboundaries::us_states(states = "Florida") |>
sf::st_transform(crs = sf::st_crs(FL_Fires.sf)) |>
as.owin()
F.ppp <- F.ppp[W]
marks(F.ppp) <- as.factor(marks(F.ppp)) # make the character marks factor marks
summary(F.ppp)## Marked planar point pattern: 2000 points
## Average intensity 1.297232e-08 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Multitype:
## frequency proportion intensity
## Arson 147 0.0735 9.534653e-10
## Campfire 32 0.0160 2.075571e-10
## Children 93 0.0465 6.032128e-10
## Debris Burning 239 0.1195 1.550192e-09
## Equipment Use 308 0.1540 1.997737e-09
## Fireworks 4 0.0020 2.594463e-11
## Lightning 495 0.2475 3.210649e-09
## Miscellaneous 199 0.0995 1.290746e-09
## Missing/Undefined 74 0.0370 4.799757e-10
## Powerline 9 0.0045 5.837543e-11
## Railroad 365 0.1825 2.367448e-09
## Smoking 31 0.0155 2.010709e-10
## Structure 4 0.0020 2.594463e-11
##
## Window: polygonal boundary
## 4 separate polygons (no holes)
## vertices area relative.area
## polygon 1 356 1.53185e+11 0.994000
## polygon 2 15 8.05114e+08 0.005220
## polygon 3 5 7.46249e+07 0.000484
## polygon 4 5 1.09937e+08 0.000713
## enclosing rectangle: [52649.1, 794026.5] x [56850.4, 781579.4] units
## (741400 x 724700 units)
## Window area = 1.54174e+11 square units
## Fraction of frame area: 0.287
Output from the summary() method displays a table of frequency by type including the proportion and the average spatial intensity (per square meters).
The probability that a wildfire is caused by lightning is about 25% (proportion column of the frequency versus type table). How does this probability vary over the state?
Note that the window contains four separate polygons to capture the main boundary (polygon 4) and the Florida Keys.
plot(W)
First split the object F.ppp on whether or not the cause was lightning and then merge the two event types and assign names NL (human caused) and L (lightning caused) as marks.
L.ppp <- F.ppp[F.ppp$marks == "Lightning"] |>
unmark()
NL.ppp <- F.ppp[F.ppp$marks != "Lightning"] |>
unmark()
LNL.ppp <- superimpose(NL = NL.ppp,
L = L.ppp)## Warning: data contain duplicated points
summary(LNL.ppp)## Marked planar point pattern: 2000 points
## Average intensity 1.297232e-08 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Multitype:
## frequency proportion intensity
## NL 1505 0.7525 9.761669e-09
## L 495 0.2475 3.210649e-09
##
## Window: polygonal boundary
## 4 separate polygons (no holes)
## vertices area relative.area
## polygon 1 356 1.53185e+11 0.994000
## polygon 2 15 8.05114e+08 0.005220
## polygon 3 5 7.46249e+07 0.000484
## polygon 4 5 1.09937e+08 0.000713
## enclosing rectangle: [52649.1, 794026.5] x [56850.4, 781579.4] units
## (741400 x 724700 units)
## Window area = 1.54174e+11 square units
## Fraction of frame area: 0.287
Now the two types are NL and L composing 75% and 25% of all wildfire events.
The function relrisk() computes the spatially-varying probability of a case (event type), (i.e. the probability \(p(u)\) that a point at location \(u\) will be a case).
Here you compute the relative risk on a 256 by 256 grid.
wfr <- relrisk(LNL.ppp,
dimyx = c(256, 256))Create a map from the raster by first converting the image object to a raster object and assigning the CRS with the crs() function from the {raster} package. Add the county borders for geographic reference.
wfr.r <- raster::raster(wfr)
raster::crs(wfr.r) <- sf::st_crs(FL_Fires.sf)$proj4string
FL.sf <- USAboundaries::us_counties(state = "FL") |>
sf::st_transform(crs = sf::st_crs(FL_Fires.sf))
tmap::tm_shape(wfr.r) +
tmap::tm_raster(title = "Probability") +
tmap::tm_shape(FL.sf) +
tmap::tm_borders(col = "gray70") +
tmap::tm_legend(position = c("left", "center") ) +
tmap::tm_layout(main.title = "Chance a wildfire was started by lightning (1992-2015)",
main.title.size = 1) +
tmap::tm_compass(position = c("right", "top")) +
tmap::tm_credits(text = "Data source: Karen Short https://doi.org/10.2737/RDS-2013-0009.4",
position = c("left", "bottom")) 
Estimating second-moment properties of spatial events
Spatial intensity is a first-moment property of event locations (like the average of a set of numbers). It answers the question: where are events more and less frequent?
Clustering is a second-moment property of event locations (like the variance of a a set of numbers). It answers the question: is the probability of an event in the proximity of another event higher than expected by chance?
On example of cluster occurs with the location of trees in a forest. A tree’s seed dispersal mechanism leads to a greater likelihood of another tree nearby.
Let \(r\) be the distance between two event locations or the distance between an event and an arbitrary point within the domain, then functions to describe clustering include:
The nearest neighbor distance function \(G(r)\): The cumulative distribution of the distances from an event to the nearest other event (event-to-event function). It summarizes the distance between events (amount of clustering).
The empty space function \(F(r)\): The cumulative distribution of the distances from a point in the domain to the nearest event (point-to-event function). It summarizes the distance gaps between events (amount of gappiness or lacunarity).
The reduced second-moment function (Ripley \(K\)) \(K(r)\): Defined such that \(\lambda \times K(r)\) is the expected number of additional events within a distance \(r\) of an event, where \(\lambda\) is the average intensity of the events. It is a measure of the spatial autocorrelation among the events.
To assess the degree of clustering and significance (in a statistical sense), we estimate values of the function using our data set and compare the resulting curve (empirical curve) to a theoretical curve assuming a homogeneous Poisson process.
The theoretical curve is well defined for homogeneous point patterns (recall: CSR–complete spatial randomness). Deviations of an ‘empirical’ curve from a theoretical curve provides evidence against CSR.
The theoretical functions assuming a homogeneous Poisson process are:
- \(F(r) = G(r) = 1 - \exp(-\lambda \pi r^2)\)
- \(K(r) = \pi r^2\)
where \(\lambda\) is the domain average spatial intensity and \(\exp()\) is the exponential function.
Recall the Swedish pine saplings data that comes with the {spatstat} package.
data(swedishpines)
class(swedishpines)## [1] "ppp"
Assign the data to an object called SP to reduce the amount of typing.
( SP <- swedishpines )## Planar point pattern: 71 points
## window: rectangle = [0, 96] x [0, 100] units (one unit = 0.1 metres)
The output indicates that there are 71 events within a rectangle window 96 by 100 units where one unit is .1 meters.
You obtain the values for the nearest neighbor function using the Gest() function from the {spatstat} package. Use the argument correction = "none" so no corrections are made for events near the window borders. Assign the output to a list object called G.
( G <- Gest(SP,
correction = "none") )## Function value object (class 'fv')
## for the function r -> G(r)
## ................................................
## Math.label Description
## r r distance argument r
## theo G[pois](r) theoretical Poisson G(r)
## raw hat(G)[raw](r) uncorrected estimate of G(r)
## ................................................
## Default plot formula: .~r
## where "." stands for 'raw', 'theo'
## Recommended range of argument r: [0, 22.26]
## Available range of argument r: [0, 22.26]
## Unit of length: 0.1 metres
The output includes the distance r, the raw uncorrected estimate of \(G(r)\) (empirical estimate) at various distances, and a theoretical estimate at those same distances based on a homogeneous Poisson process. Using the plot() method on the saved object G you compare the empirical estimates with the theoretical estimates. Here two horizontal lines are added to help with the interpretation.
plot(G)
abline(h = c(.2, .5),
col = "black",
lty = 2)
Values of G are on the vertical axis and values of distance (lag) are on the horizontal axis starting at 0. The black curve is the uncorrected estimate of \(G_{raw}(r)\) from the event locations and the red curve is \(G_{pois}(r)\) estimated from a homogeneous Poisson process with the same average intensity as the pine saplings.
The horizontal dashed line at G = .2 intersects the black line at a relative distance (r) of 5 units. This means that 20% of the events have another event within 5 units. This means that 20% of the saplings have another sapling withing .5 meter.
Imagine placing a disc of radius 5 units around all 71 events then counting the number of events that have another event under the disc. That number divided by 71 is G(r).
To check this compute all pairwise distances with the pairdist() function.
PDmatrix <- pairdist(SP)
PDmatrix[1:6, 1:6]## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00000 27.000000 37.01351 15.03330 54.33231 25.298221
## [2,] 27.00000 0.000000 10.04988 12.04159 27.65863 8.544004
## [3,] 37.01351 10.049876 0.00000 22.00000 17.72005 14.764823
## [4,] 15.03330 12.041595 22.00000 0.00000 39.31921 11.401754
## [5,] 54.33231 27.658633 17.72005 39.31921 0.00000 30.066593
## [6,] 25.29822 8.544004 14.76482 11.40175 30.06659 0.000000
This creates a 71 x 71 square matrix of distances.
Sum the number of rows whose distances are within 5 units. The minus one means you don’t count the row containing event over which you are summing (an event location is not a neighbor of itself).
sum(rowSums(PDmatrix < 5) - 1) / nrow(PDmatrix) * 100## [1] 19.71831
Returning to the plot, the horizontal dashed line at G = .5 intersects the black line at .8 meters indicating that 50% of the pine saplings have another pine sapling within .8 meter.
You see that for a given radius the \(G_{raw}\) line is below the \(G_{pois}(r)\) line indicating that there are fewer pine saplings with another pine sapling in the vicinity than expected by chance.
For example, if the saplings were arranged under a model of CSR, you would expect 20% of the pairwise distances to be within .3 meter and 50% of them to be within .55 meter.
You make a better plot by first converting the object G to a data frame and then using {ggplot2} functions. Here you do this and then remove estimates for distances greater than 1.1 meter and convert the distance units to meters.
G.df <- as.data.frame(G) |>
dplyr::filter(r < 11) |>
dplyr::mutate(r = r * .1)
ggplot(data = G.df,
mapping = aes(x = r, y = raw)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
geom_hline(yintercept = c(.2, .5), lty = 'dashed') +
xlab("Lag distance (m)") + ylab("G(r): Cumulative % of events having another event within a distance r") +
theme_minimal()
Values for the empty space function are obtained from the Fest() function. Here you apply the Kaplan-Meier correction for edge effects with correction = "km". The function returns the percent of the domain within a distance from any event.
Imagine again placing the disc, but this time on top of every point in the window and counting the number of points that have an event underneath.
Make a plot and add some lines to help with interpretation.
F.df <- SP |>
Fest(correction = "km") |>
as.data.frame() |>
dplyr::filter(r < 11) |>
dplyr::mutate(r = r * .1)
ggplot(data = F.df,
mapping = aes(x = r, y = km)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
geom_hline(yintercept = c(.7, .58), lty = 'dashed') +
geom_vline(xintercept = .61, lty = 2) +
xlab("Lag distance (m)") + ylab("Percent of domain within a distance r of an event") +
theme_minimal()
The horizontal dashed line at F = .7 intersects the black line at a distance of .61 meter. This means that 70% of the spatial domain is less than .61 meters from a sapling. The red line is the theoretical homogeneous Poisson process model. If the process was CSR slightly less than 58% (F = .58) of the domain would be less than .6 meter from a sapling. In words, the arrangement of saplings is less “gappy” (more regular) than expected by chance.
The J function is the ratio of the F function to the G function. For a CSR processes the value of J is one. Here we see a large and systematic departure of J from one for distances greater than about .5 meter, due to the regularity in the spacing of the saplings.
J.df <- SP |>
Jest() |>
as.data.frame() |>
dplyr::filter(r < 10) |>
dplyr::mutate(r = r * .1)
ggplot(data = J.df,
mapping = aes(x = r, y = km)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
xlab("Lag distance (m)") + ylab("") +
theme_minimal()
A commonly used distance function for assessing clustering in point pattern data is called Ripley’s K function. It is estimated with the Kest() function.
Mathematically it is defined as
\[ \hat K(r) = \frac{1}{\hat \lambda} \sum_{j \ne i} \frac{I(r_{ij} < r)}{n} \]
where \(r_{ij}\) is the Euclidean distance between event \(i\) and event \(j\), \(r\) is the search radius, and \(\hat \lambda\) is an estimate of the intensity \((\hat \lambda = n/|A|)\) where \(|A|\) is the window area and \(n\) is the number of events. \(I(.)\) is an indicator function equal to 1 when the expression \(r_{ij} < r\), and 0 otherwise. If the events are homogeneous, \(\hat{K}(r)\) increases at a rate proportional to \(\pi r^2\).
Thursday October 27, 2022
“Good code is its own best documentation. As you’re about to add a comment, ask yourself, ‘How can I improve the code so that this comment isn’t needed?’ Improve the code and then document it to make it even clearer.” - Steve McConnell
Today
- Examples of spatially clustered events
- Determining the statistical significance of event clustering
- Estimating event clustering in multi-type event locations
- More about the Ripley K function
Examples of spatially clustered events
Bramble canes
The locations of bramble canes are available as a marked ppp object in the {spatstat} package. A bramble is a rough (usually wild) tangled prickly shrub with thorny stems.
suppressMessages(library(spatstat))
data(bramblecanes)
summary(bramblecanes)## Marked planar point pattern: 823 points
## Average intensity 823 points per square unit (one unit = 9 metres)
##
## Coordinates are given to 3 decimal places
## i.e. rounded to the nearest multiple of 0.001 units (one unit = 9 metres)
##
## Multitype:
## frequency proportion intensity
## 0 359 0.43620900 359
## 1 385 0.46780070 385
## 2 79 0.09599028 79
##
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit
## Unit of length: 9 metres
The marks represent three different ages (as an ordered factor) for the bramble canes. The unit of length is 9 meters.
plot(bramblecanes) 
Consider the point pattern for all the bramble canes regardless of age and estimate the \(K\) function and a corresponding plot. Plot the empirical estimate of \(K\) with an ‘isotropic’ correction at the domain borders (iso). Include a line for the theoretical \(K\) under the assumption of CSR.
K.df <- bramblecanes |>
Kest() |>
as.data.frame() |>
dplyr::mutate(r = r * 9)
library(ggplot2)
ggplot(data = K.df,
mapping = aes(x = r, y = iso)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
xlab("Lag distance (m)") + ylab("K(r)") +
theme_minimal()
The \(K\) estimate from the actual data (black line) lies to the left of the theoretical \(K\) under CSR (red line). This means that for any distance from an event (lag distance) there tends to be more events within this distance (larger \(K\)) than expected under CSR. You conclude that these bramble canes are more clustered than CRS.
The expected number of additional events is multiplied by the total number of events (823) so a value of .1 indicates that at a distance of 1.6 meters (where .1 value of \(K(r)\) intersects the red curve) you should expect to see about 82 additional events.
Kansas tornado reports
Previously you mapped the intensity of tornadoes across Kansas using the start locations as point pattern data. Here we return to these data and consider only tornadoes since 1994.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
sf::st_transform(crs = 3082) |>
dplyr::filter(mag >= 0, yr >= 1994) |>
dplyr::mutate(EF = as.factor(mag)) |>
dplyr::select(EF)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
T.ppp <- Torn.sf["EF"] |>
as.ppp()
KS.sf <- USAboundaries::us_states(states = "Kansas") |>
sf::st_transform(crs = sf::st_crs(Torn.sf)$proj4string)
W <- KS.sf |>
as.owin()
T.ppp <- T.ppp[W] |>
spatstat.geom::rescale(s = 1000,
unitname = "km")
T.ppp |>
plot()
T.ppp |>
summary()## Marked planar point pattern: 2241 points
## Average intensity 0.01038475 points per square km
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 4 decimal places
##
## Multitype:
## frequency proportion intensity
## 0 1623 0.7242303000 7.520953e-03
## 1 436 0.1945560000 2.020416e-03
## 2 104 0.0464078500 4.819342e-04
## 3 64 0.0285586800 2.965749e-04
## 4 13 0.0058009820 6.024177e-05
## 5 1 0.0004462294 4.633982e-06
##
## Window: polygonal boundary
## single connected closed polygon with 169 vertices
## enclosing rectangle: [1317.6759, 1980.2948] x [7114.969, 7458.57] km
## (662.6 x 343.6 km)
## Window area = 215797 square km
## Unit of length: 1 km
## Fraction of frame area: 0.948
There are 2241 events with an average intensity of .01 events per square km (1 tornado per 10 square km over the 26-year period 1994–2020).
You compare the \(K\) function estimated from the set of tornado reports with a theoretical \(K\) function from a model of CSR.
K.df <- T.ppp |>
Kest(correction = "iso") |>
as.data.frame() |>
dplyr::mutate(Kdata = iso * sum(intensity(T.ppp)),
Kpois = theo * sum(intensity(T.ppp)))
ggplot(data = K.df,
mapping = aes(x = r, y = Kdata)) +
geom_line() +
geom_line(mapping = aes(y = Kpois), color = "red") +
geom_vline(xintercept = 60, lty = 'dashed') +
geom_hline(yintercept = 129, lty = 'dashed') +
geom_hline(yintercept = 115, lty = 'dashed') +
xlab("Lag distance (km)") + ylab("K(r), Expected number of additional tornadoes\n within a distance r of any tornado") +
theme_minimal()
Consider the lag distance of 60 km along the horizontal axis. If you draw a vertical line at that distance it intersects the black curve at a height of about 129. This value indicates that at a distance of 60 km from a random tornado report about 129 other tornado reports are in the vicinity (on average).
Imagine placing a disc with radius 60 km centered on each event and then averaging the number of events under the disc over all events.
The red line is the theoretical curve under the assumption that the tornado reports are CSR across the state. If this is the case then you would expect to see about 115 tornadoes within a distance 60 km from any tornado (on average). Since there are MORE tornadoes than expected within a given 60 km radius you conclude that there is evidence for clustering (at this spatial scale).
The black line lies above the red line across distances from 0 to greater than 100 km.
How do you interpret the results of applying the nearest neighbor function to these data? Here you create a data frame from the output of the Gest() function and remove distances exceeding 8 km.
G.df <- T.ppp |>
Gest(correction = "km") |>
as.data.frame() |>
dplyr::filter(r < 8)
ggplot(data = G.df,
mapping = aes(x = r, y = km)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
geom_hline(yintercept = .4, lty = 'dashed') +
geom_vline(xintercept = c(3.2, 4), lty = 'dashed') +
xlab("Lag distance (km)") + ylab("G(r): Cumulative % of tornadoes\n within a distance r of another tornado") +
theme_minimal()
The interpretation is that 40% (\(G\) = .4) of all tornado reports have another report within a distance of about 3.2 km on average. If the reports where homogeneous Poisson (CSR) then the distance would be 4 km. We conclude they are more clustered.
Note: With a data set containing many events the difference between the raw and border-corrected estimates of the distance functions is typically small.
Determining the statistical significance of event clustering
The plots show a separation between the black solid line and the red line, but is this separation large relative to sampling variation? Is the above difference between the empirical and theoretical distance functions (e.g., \(G\)) large enough to conclude there is significant clustering?
There are two ways to approach statistical inference. 1) Compare the function computed with the observed data against the function computed data generated under the null hypothesis and ask: does the function fall outside the envelope of functions from the null cases? 2) Get estimates of uncertainty on the function and ask: does the uncertainty interval contain the null case?
With the first approach you take a ppp object and then compute the function of interest (e.g., Ripley’s K) for a specified number of samples under the null hypothesis of a homogeneous Poisson process.
To make things run faster you consider a subset of all the tornadoes (those that have an EF rating of 2 or higher). You create a new ppp object that contains only tornadoes rated at least EF2. Since the marks is a factor vector you can’t use >=.
ST.ppp <- unmark(T.ppp[T.ppp$marks == 2 |
T.ppp$marks == 3 |
T.ppp$marks == 4 |
T.ppp$marks == 5])
plot(ST.ppp)
The envelope() method from the {spatstat} package is used on this new ST.ppp object. You specify the function with the fun = Kest argument and the number of samples with the nsim = argument. You then convert the output to a data frame. It takes a few seconds to complete the computation of \(K\) for all 99 samples.
Kenv.df <- envelope(ST.ppp,
fun = Kest,
nsim = 99) |>
as.data.frame()## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
head(Kenv.df)## r obs theo lo hi
## 1 0.0000000 0.00000 0.00000000 0 0.00000
## 2 0.1677738 13.10164 0.08842974 0 11.40878
## 3 0.3355477 13.10164 0.35371897 0 11.40878
## 4 0.5033215 13.10164 0.79586769 0 12.54270
## 5 0.6710954 13.10164 1.41487589 0 13.54574
## 6 0.8388692 13.10164 2.21074358 0 24.29327
The resulting data frame contains estimates of Ripley’s \(K\) as a function of lag distance (r) (column labeled obs). It also has the estimates of \(K\) under the null hypothesis of CSR (theo) and the lowest (lo) and highest (hi) values of \(K\) across the 99 samples.
You plot this information using the geom_ribbon() layer to include a gray ribbon around the model of CSR.
ggplot(data = Kenv.df,
mapping = aes(x = r, y = obs * intensity(ST.ppp))) +
geom_ribbon(mapping = aes(ymin = lo * intensity(ST.ppp),
ymax = hi * intensity(ST.ppp)), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo * intensity(ST.ppp)), color = "red") +
xlab("Lag distance (km)") + ylab("K(r)") +
theme_minimal()
The \(K\) function computed on the data is the black line and the \(K\) function under CSR is the red line. The uncertainty ribbon (gray band) connects the point-wise minimum and maximum values of \(K\) computed from the 99 generated point pattern samples.
Since the black line lies outside the gray band you can confidently conclude that the tornado reports are more clustered than one would expect by chance.
If the specific intention is to test a null hypothesis of CSR, then a single statistic indicating the departure of \(K\) computed on the observations from the theoretical \(K\) is appropriate.
One such statistic is the maximum absolute deviation (MAD) and is implemented with the mad.test() function from the {spatstat} package. The function performs a hypothesis test for goodness-of-fit of the observations to the theoretical model. The larger the value of the statistic, the less likely it is that the data are CSR.
mad.test(ST.ppp,
fun = Kest,
nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Maximum absolute deviation test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: K(r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 85.9002061907829] km
## Test statistic: Maximum absolute deviation
## Deviation = observed minus theoretical
##
## data: ST.ppp
## mad = 7297.2, rank = 1, p-value = 0.01
The maximum absolute deviation is 7297 which is very large so the \(p\)-value is small and you reject the null hypothesis of CSR for these data. This is consistent with the graph. Note: Since there are 99 simulations the lowest \(p\)-value is .01.
Another test statistic is related to the sum of the squared deviations between the estimated and theoretical functions. It is implemented with the dclf.test() function.
dclf.test(ST.ppp,
fun = Kest,
nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Diggle-Cressie-Loosmore-Ford test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: K(r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 85.9002061907829] km
## Test statistic: Integral of squared absolute deviation
## Deviation = observed minus theoretical
##
## data: ST.ppp
## u = 1548888704, rank = 1, p-value = 0.01
Again the \(p\)-value on the test statistic against the two-sided alternative is less than .01.
Compare these test results on tornado report clustering with test results on pine sapling clustering in the swedishpines data set.
SP <- swedishpines
Kenv.df <- envelope(SP,
fun = Kest,
nsim = 99) |>
as.data.frame()## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
ggplot(data = Kenv.df,
mapping = aes(x = r * .1, y = obs * intensity(SP))) +
geom_ribbon(aes(ymin = lo * intensity(SP),
ymax = hi * intensity(SP)),
fill = "gray70") +
geom_line() + geom_line(aes(y = theo * intensity(SP)),
color = "red") +
xlab("Lag distance (m)") +
ylab("K(r), Expected number of additional saplings\n within a distance r of a sapling") +
theme_minimal()
At short distances (closer than about 1 m) the black line is below the red line and outside the gray ribbon which you interpret to mean that there are fewer pine saplings near other pine saplings than would be expected by chance at this scale. This ‘regularity’ might be the result of competition among the saplings.
At larger distances the black line is close to the red line and inside the gray ribbon which you interpret to mean that, at this larger spatial scale, the distribution of pine saplings is indistinguishable from CSR.
Based on the fact that much of the black line is within the gray envelope you might anticipate that a formal test against the null hypothesis of CSR will likely fail to reject.
mad.test(SP,
fun = Kest,
nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Maximum absolute deviation test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: K(r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 24] units (one unit = 0.1 metres)
## Test statistic: Maximum absolute deviation
## Deviation = observed minus theoretical
##
## data: SP
## mad = 150.69, rank = 25, p-value = 0.25
dclf.test(SP,
fun = Kest,
nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Diggle-Cressie-Loosmore-Ford test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: K(r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 24] units (one unit = 0.1 metres)
## Test statistic: Integral of squared absolute deviation
## Deviation = observed minus theoretical
##
## data: SP
## u = 106917, rank = 17, p-value = 0.17
Both return a \(p\)-value that is greater than .15 so you fail to reject the null hypothesis of CSR.
In the second approach to inference the procedure of re-sampling is used. Note the distinction: Re-sampling refers to generating samples from the data while sampling, as above, refers to generating samples from some theoretical model.
The bootstrap procedure is a re-sampling strategy whereby new samples are generated from the data by randomly choosing events within the domain. An event that is chosen for the ‘bootstrap’ sample gets the chance to be chosen again (called ‘with replacement’). The number of events in each bootstrap sample must equal the number of events in the data.
Consider 15 numbers from 1 to 15. Then pick randomly from that set of numbers with replacement until the sample size is 15 to create a bootstrap sample.
( x <- 1:15 )## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sample(x, replace = TRUE)## [1] 7 3 4 11 1 10 9 8 3 11 5 11 12 10 15
Some numbers get picked more than once and some do not get picked at all.
The average of the original 15 x values is 8 but the average over the set of numbers in the bootstrap sample will not necessarily be 8. However, the distribution of the averages over many bootstrap samples will be centered close to this average.
mx <- NULL
for(i in 1:99){
mx[i] <- mean(sample(x, replace = TRUE))
}
mx.df <- as.data.frame(mx)
ggplot(data = mx.df,
mapping = aes(mx)) +
geom_density() +
geom_vline(xintercept = mean(x),
color = "red")
The important thing is that the bootstrap distribution provides an estimate of the uncertainty on the computed mean through the range of possible average values.
In this way, the lohboot() function estimates the uncertainty on the computed spatial statistic using a bootstrap procedure. It works by computing a local version of the function (e.g., localK()) on the set of re-sampled events.
Kboot.df <- ST.ppp |>
lohboot(fun = Kest) |>
as.data.frame()## 1, 2, 3, 4.6.8.10.12.14.16.18.20.22.24.26.28.30.32.34.36.38.40
## .42.44.46.48.50.52.54.56.58.60.62.64.66.68.70.72.74.76.78.80
## .82.84.86.88.90.92.94.96.98.100.102.104.106.108.110.112.114.116.118.120
## .122.124.126.128.130.132.134.136.138.140.142.144.146.148.150.152.154.156.158.160
## .162.164.166.168.170.172.174.176.178.180. 182.
ggplot(data = Kboot.df,
mapping = aes(x = r, y = iso * intensity(ST.ppp))) +
geom_ribbon(aes(ymin = lo * intensity(ST.ppp),
ymax = hi * intensity(ST.ppp)), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo * intensity(ST.ppp)), color = "red") +
xlab("Lag distance (km)") + ylab("K(r)") +
theme_minimal()
Now the uncertainty band is plotted about the black line (\(K\) function computed on the observations) rather than about the null model (red line). The 95% uncertainty band does to include the CSR model so you confidently conclude that the tornadoes in Kansas are more clustered than chance.
Repeating for the Swedish pine saplings.
Kboot.df <- SP |>
lohboot(fun = Kest) |>
as.data.frame()## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71.
ggplot(Kboot.df, aes(x = r * .1, y = iso * intensity(SP))) +
geom_ribbon(aes(ymin = lo * intensity(SP),
ymax = hi * intensity(SP)), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo * intensity(SP)), color = "blue", lty = 'dashed') +
xlab("Lag distance (m)") + ylab("K(r)") +
theme_minimal()
At short distances (closer than about 1.5 m) the gray ribbon is below the blue line which you interpret to mean that there are fewer pine saplings near other pine saplings than would be expected by chance at this scale indicating regularity.
Estimating event clustering in multi-type event locations
Often the interest focuses on whether the occurrence of one event type influences (or is influenced by) another event type. For example, does the occurrence of one species of tree influence the occurrence of another species?
Analogues to the \(G\) and \(K\) functions are available for ‘multi-type’ point patterns where the marks are factors.
A common statistic for examining ‘cross correlation’ of event type occurrences is the cross \(K\) function \(K_{ij}(r)\), which estimates the expected number of events of type \(j\) within a distance \(r\) of type \(i\).
Consider the data called lansing from the {spatstat} package that contains the locations of 2,251 trees of various species in a wooded lot in Lansing, MI as a ppp object.
data(lansing)
summary(lansing)## Marked planar point pattern: 2251 points
## Average intensity 2251 points per square unit (one unit = 924 feet)
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 3 decimal places
## i.e. rounded to the nearest multiple of 0.001 units (one unit = 924 feet)
##
## Multitype:
## frequency proportion intensity
## blackoak 135 0.05997335 135
## hickory 703 0.31230560 703
## maple 514 0.22834300 514
## misc 105 0.04664594 105
## redoak 346 0.15370950 346
## whiteoak 448 0.19902270 448
##
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit
## Unit of length: 924 feet
The data are a multi-type planar point pattern with the marks indicating tree species. There are 135 black oaks, 703 hickories, etc. The spatial unit is 924 feet.
Compute and plot the cross \(K\) function for Maple and Hickory trees.
Kc.df <- lansing |>
Kcross(i = "maple",
j = "hickory") |>
as.data.frame()
ggplot(data = Kc.df,
mapping = aes(x = r, y = iso)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
geom_vline(xintercept = .2, lty = 'dashed') +
geom_hline(yintercept = .093, lty = 'dashed') +
geom_hline(yintercept = .125, lty = 'dashed') +
xlab("Distance") + ylab("Kc(r)") +
theme_minimal()
The vertical axis is the number of hickory trees within a radius r of a maple tree divided by the average intensity of the hickories. So at a distance of .2 (.2 x 924 ft = 180 ft) from a random maple there is an average of roughly 65 hickories (.093 x 703 hickories). If hickory and maple trees are CSR you would expect about 88 maples (.125 * 703) within that distance.
The presence of a hickory tree reduces the likelihood that a maple tree will be nearby.
Do the same for the EF1 and EF3 tornadoes in Kansas.
plot(Kcross(T.ppp,
i = "1",
j = "3"))
abline(v = 70)
abline(h = 18700)
abline(h = 15500)
Kc.df <- T.ppp |>
Kcross(i = "1",
j = "3") |>
as.data.frame()
ggplot(data = Kc.df,
mapping = aes(x = r, y = iso)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
geom_vline(xintercept = 70, lty = 'dashed') +
geom_hline(yintercept = 18700, lty = 'dashed') +
geom_hline(yintercept = 15500, lty = 'dashed') +
xlab("Distance") + ylab("Kc(r)") +
theme_minimal()
The vertical axis is the number of EF3 tornadoes within a radius r of an EF1 tornado divided by the average intensity of the EF3 tornadoes. At a distance of 70 km from a random EF1 tornado there are on average 18500 x .000296 = 5.5 EF3 tornadoes. If EF1 and EF3 tornadoes are CSR then you would expect, on average, somewhat fewer EF3 tornadoes in the vicinity of EF1 tornadoes (15500 x .000296 = 4.6).
You can see this more clearly by using the envelope() function with the fun = Kross. You first use the subset() method with drop = TRUE to make a new ppp object with only those two groups.
T.ppp13 <- subset(T.ppp,
marks == "1" |
marks == "3",
drop = TRUE)
Kcenv.df <- T.ppp13 |>
envelope(fun = Kcross,
nsim = 99) |>
as.data.frame()## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
ggplot(data = Kcenv.df,
mapping = aes(x = r, y = obs)) +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo), color = "red", lty = 'dashed') +
xlab("Lag distance (km)") + ylab("Kc(r)") +
theme_minimal()
And you can formally test as before using the mad.test() function.
mad.test(T.ppp13, fun = Kcross, nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Maximum absolute deviation test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: "K"["1", "3"](r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 85.9002061907829] km
## Test statistic: Maximum absolute deviation
## Deviation = observed minus theoretical
##
## data: T.ppp13
## mad = 4234.6, rank = 1, p-value = 0.01
dclf.test(T.ppp13, fun = Kcross, nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Diggle-Cressie-Loosmore-Ford test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: "K"["1", "3"](r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 85.9002061907829] km
## Test statistic: Integral of squared absolute deviation
## Deviation = observed minus theoretical
##
## data: T.ppp13
## u = 402609991, rank = 1, p-value = 0.01
Both tests lead you to conclude that EF3 tornadoes are more likely near EF1 tornadoes than would be expected if they were independently CSR.
More about the Ripley K function
Compute Ripley \(K\) and look at the classes of the resulting object.
K <- Kest(T.ppp)
class(K)## [1] "fv" "data.frame"
It has two classes fv and data.frame. It is a data frame but with additional attribute information. You focus on the data frame portion.
K.df <- as.data.frame(K)
head(K.df)## r theo border trans iso
## 1 0.0000000 0.00000000 0.000000 0.000000 0.000000
## 2 0.1677738 0.08842974 5.586056 5.588550 5.588550
## 3 0.3355477 0.35371897 6.118072 6.104416 6.104416
## 4 0.5033215 0.79586769 6.745407 6.792237 6.792237
## 5 0.6710954 1.41487589 7.264284 7.308103 7.308103
## 6 0.8388692 2.21074358 7.963272 7.995925 7.995925
In particular you want the values of r and iso. The value of iso times the average spatial intensity is the number of tornadoes within a distance r.
You add this information to the data frame.
K.df <- K.df |>
dplyr::mutate(nT = summary(T.ppp)$intensity * iso)Suppose you are interested in the average number of tornadoes at a distance of exactly 50 km. Use the approx() function to interpolate the value of nT at a distance of 50 km.
approx(x = K.df$r,
y = K.df$nT,
xout = 50)$y## [1] 92.39087
Finally, the variance stabilized Ripley \(K\) function called the \(L\) function is often used instead of \(K\). The sample version of the \(L\) function is defined as \[ \hat{L}(r) = \Big( \hat{K}(r)/\pi\Big)^{1/2}. \]
For data that is CSR, the \(L\) function has expected value \(r\) and its variance is approximately constant in \(r\). A common plot is a graph of \(r - \hat{L}(r)\) against \(r\), which approximately follows the horizontal zero-axis with constant dispersion if the data follow a homogeneous Poisson process.
Tuesday November 1, 2022
“Weeks of coding can save you hours of planning.” - Unknown
Today
- Inferring event interaction from distance functions
- Removing duplicate event locations and defining the domain
- Modeling point pattern data
- Fitting and interpreting an inhibition model
Inferring event interaction from distance functions
The distance functions (\(G\), \(K\), etc) that are used to quantify clustering are defined and estimated under the assumption that the process that produced the events is stationary (homogeneous). If this is true then you can treat any sub-region of the domain as an independent and identically distributed (iid) sample from the entire set of data.
If the spatial distribution of the event locations is influenced by event interaction then the functions will deviate from the theoretical model of CSR. But a deviation from CSR does not imply event interaction.
Moreover, the functions characterize the spatial arrangement of event locations ‘on average’ so variability in an interaction as a function of scale may not be detected.
As an example of the latter case, here you generate event locations at random with clustering on a small scale but with regularity on a larger scale. On average the event locations are CSR as indicated by the \(K\) function.
suppressMessages(library(spatstat))
set.seed(0112)
X <- rcell(nx = 15)
plot(X, main = "")
There are two ‘local’ clusters one in the north and one in the south. But overall the events appear to be more regular (inhibition) than CSR.
Interpretation of the process that created the event locations based on Ripley’s \(K\) would be that the arrangement of events is CSR.
library(ggplot2)
K.df <- X |>
Kest() |>
as.data.frame()
ggplot(K.df, aes(x = r, y = iso)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
xlab("Lag distance (km)") + ylab("K(r)") +
theme_minimal()
The empirical curve (black line) coincides with the theoretical CSR line (red line) indicating CSR.
And the maximum absolute deviation test under the null hypothesis of CSR returns a large \(p\)-value so you fail to reject it.
mad.test(X, fun = Kest, nsim = 99)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
##
## Maximum absolute deviation test of CSR
## Monte Carlo test based on 99 simulations
## Summary function: K(r)
## Reference function: theoretical
## Alternative: two.sided
## Interval of distance values: [0, 0.25]
## Test statistic: Maximum absolute deviation
## Deviation = observed minus theoretical
##
## data: X
## mad = 0.0023931, rank = 87, p-value = 0.87
As an example of the former case, here you generate event locations that have no inter-event interaction but there is a trend in the spatial intensity.
X <- rpoispp(function(x, y){ 300 * exp(-3 * x) })
plot(X, main = "") 
By design there is a clear trend toward fewer events moving toward the east.
You compute and plot the \(K\) function on these event locations.
K.df <- X |>
Kest() |>
as.data.frame()
ggplot(K.df, aes(x = r, y = iso)) +
geom_line() +
geom_line(aes(y = theo), color = "red") +
xlab("Lag distance (km)") + ylab("K(r)") +
theme_minimal()
The \(K\) function indicates clustering but this is an artifact of the trend in the intensity.
In the case of a known trend in the spatial intensity, you need to use the Kinhom() function. For example, compare the uncertainty envelopes from a homogeneous and inhomogeneous Poisson process.
Start by plotting the output from the envelope() function with fun = Kest. The global = TRUE argument indicates that the envelopes are simultaneous rather than point-wise (global = FALSE which is the default). Point-wise envelopes assume the estimates are independent (usually not a good assumption) across the range of distances so the standard errors will be smaller resulting in narrower bands.
Kenv <- envelope(X,
fun = Kest,
nsim = 39,
rank = 1,
global = TRUE)## Generating 39 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
##
## Done.
Kenv.df <- as.data.frame(Kenv)
ggplot(Kenv.df, aes(x = r, y = obs)) +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo), color = "red", lty = 'dashed') +
xlab("Lag distance (km)") + ylab("K(r)") +
theme_minimal()
After a distance of about .15 units the empirical curve (black line) is outside the uncertainty band indicating the events are more clustered than CSR.
However when you use the fun = Kinhom the empirical curve is completely inside the uncertainty band.
Kenv <- envelope(X,
fun = Kinhom,
nsim = 99,
rank = 1,
global = TRUE)## Generating 99 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.
Kenv.df <- as.data.frame(Kenv)
ggplot(Kenv.df, aes(x = r, y = obs)) +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo), color = "red", lty = 'dashed') +
xlab("Lag distance (km)") + ylab("K(r), Expected number of additional events\n within a distance r of an event") +
theme_minimal()
You conclude that the point pattern data are consistent with an inhomogeneous Poisson process without event interaction.
Let’s return to the Kansas tornadoes (EF1+). You import the data and create a point pattern object windowed by the state borders.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
sf::st_transform(crs = 3082) |>
dplyr::filter(mag >= 1, yr >= 1994) |>
dplyr::mutate(EF = as.factor(mag)) |>
dplyr::select(EF)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
ST.ppp <- Torn.sf["EF"] |>
as.ppp()
KS.sf <- USAboundaries::us_states(states = "Kansas") |>
sf::st_transform(crs = sf::st_crs(Torn.sf)$proj4string)
W <- KS.sf |>
as.owin()
ST.ppp <- ST.ppp[W] |>
spatstat.geom::rescale(s = 1000,
unitname = "km")
plot(ST.ppp)
There are more tornado reports in the west than in the east, especially across the southern part of the state indicating the process producing the events is not homogeneous. This means there are other factors contributing to local event intensity.
Evidence for clustering must account for this inhomogeneity. Here you do this by computing the envelope around the inhomogeneous Ripley K function using the argument fun = Kinhom.
Kenv <- envelope(ST.ppp,
fun = Kinhom,
nsim = 39,
rank = 1,
global = TRUE)## Generating 39 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
##
## Done.
Kenv.df <- as.data.frame(Kenv)
ggplot(Kenv.df, aes(x = r, y = obs)) +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo), color = "red", lty = 'dashed') +
xlab("Lag distance (km)") + ylab("K(r)") +
theme_minimal()
The output reveals no evidence of clustering at distances less than about 70 km. At greater distances there is evidence of regularity indicated by the black line significantly below the red line. This is due to the fact that tornado reports are more common near cities and towns and cities and towns tend to be spread out more regular than CSR.
Removing duplicate event locations and defining the domain
The functions in the {spatstat} package require the event locations (as a ppp object) and a domain over which the spatial statistics are computed (as an owin object).
If no owin object is specified, the statistics are computed over a rectangle (bounding box) defined by the northern most, southern most, eastern most, and western most event locations.
To see this, consider the Florida wildfire data as a simple feature data frame. Extract only fires occurring in Baker County (west of Duval County–Jacksonville). Include only wildfires started by lightning and select the fire size variable.
FL_Fires.sf <- sf::st_read(dsn = here::here("data", "FL_Fires")) |>
sf::st_transform(crs = 3086)## Reading layer `FL_Fires' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/FL_Fires'
## using driver `ESRI Shapefile'
## Simple feature collection with 90261 features and 37 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -9750382 ymin: 2824449 xmax: -8908899 ymax: 3632749
## Projected CRS: Mercator_2SP
Baker.sf <- USAboundaries::us_counties(states = "FL") |>
dplyr::select(name) |>
dplyr::filter(name == "Baker") |>
sf::st_transform(crs = 3086)
BakerFires.sf <- FL_Fires.sf |>
sf::st_intersection(Baker.sf) |>
dplyr::filter(STAT_CAU_1 == "Lightning") |>
dplyr::select(FIRE_SIZE_)## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
Create a ppp object and an unmarked ppp object. Summarize the unmarked object and make a plot.
BF.ppp <- BakerFires.sf |>
as.ppp()
BFU.ppp <- unmark(BF.ppp)
summary(BFU.ppp)## Planar point pattern: 327 points
## Average intensity 1.797954e-07 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 units
##
## Window: rectangle = [547988.2, 587567.5] x [682872.2, 728823.8] units
## (39580 x 45950 units)
## Window area = 1818730000 square units
plot(BFU.ppp)
The average intensity is 18 wildfires per 10 square km. But the intensity is based on a square domain. The lack of events in the northeast part of the domain is due to the fact that you removed wildfires outside the county border.
Further, two event locations are identical if their x,y coordinates are the same, and their marks are the same (if they carry marks).
Remove duplicate events with the unique() function, set the domain to be the county border, and set the name for the unit of length to meters.
BFU.ppp <- unique(BFU.ppp)
W <- Baker.sf |>
as.owin()
BFU.ppp <- BFU.ppp[W]
unitname(BFU.ppp) <- "meters"
summary(BFU.ppp)## Planar point pattern: 322 points
## Average intensity 2.096214e-07 points per square meters
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 meters
##
## Window: polygonal boundary
## single connected closed polygon with 17 vertices
## enclosing rectangle: [547588.2, 587682.5] x [681954.6, 731650.3] meters
## (40090 x 49700 meters)
## Window area = 1536100000 square meters
## Unit of length: 1 meters
## Fraction of frame area: 0.771
plot(BFU.ppp)
Now the average intensity is 21 wildfires per 10 sq. km.
Apply Ripley’s \(K\) function and graph the results.
K.df <- BFU.ppp |>
Kest() |>
as.data.frame()
ggplot(K.df, aes(x = r, y = iso * intensity(BFU.ppp))) +
geom_line() +
geom_line(aes(y = theo * intensity(BFU.ppp)), color = "red") +
xlab("Lag distance (m)") + ylab("K(r), Expected number of additional wildfires\n within a distance r of any wildfire") +
theme_minimal()
We see a difference indicating a cluster of event locations, but is the difference significant against a null hypothesis of a homogeneous Poisson?
Kenv.df <- envelope(BFU.ppp,
fun = Kest,
nsim = 39,
rank = 1,
global = TRUE) |>
as.data.frame()## Generating 39 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
##
## Done.
ggplot(Kenv.df, aes(x = r, y = obs)) +
geom_ribbon(aes(ymin = lo, ymax = hi), fill = "gray70") +
geom_line() +
geom_line(aes(y = theo), color = "red", lty = 'dashed') +
xlab("Lag distance (m)") + ylab("K(r)") +
theme_minimal()
Yes it is.
Modeling point pattern data
Models are helpful for trying to understanding the processes leading to the event locations when event interaction is suspected. Event interaction means that an event at one location changes the probability of an event nearby.
Cluster models can be derived by starting with a Poisson model. For example, you begin with a homogeneous Poisson model \(Y\) describing a set of events. A model is homogeneous Poisson when the event locations generated by the model are CSR.
Then consider each individual event \(y_i\) in \(Y\) to be a ‘parent’ that produces a set of ‘offspring’ events (\(x_i\)) according to some mechanism. The resulting set of offspring forms clustered point pattern data \(X\). Said another way, the model is homogeneous Poisson at an unobserved level \(Y\) (latent level) but clustered at the level of the observations (\(X\)).
One example of this parent-child process is the Matern cluster model. Parent events come from a homogeneous Poisson process with intensity \(\kappa\) and then each parent has a Poisson (\(\mu\)) number of offspring that are iid within a radius \(r\) centered on the parent.
For instance here you use the rMatClust() function from the {spatstat} package to produce a clustered ppp object. We use a disc radius of .1 units and an offspring rate equal to 5 (mu = 5).
rMatClust(kappa = 10,
r = .1,
mu = 5) |>
plot(main = "")
The result is a set of event locations and the process that produced them is described as doubly Poisson. You can vary \(\kappa\), \(r\), and \(\mu\) to generate more or fewer events.
Other clustered Poisson models include:
- Thomas model: each cluster consists of a Poisson number of random events with each event having an isotropic Gaussian displacement from its parent.
- Gauss-Poisson model: each cluster is either a single event or a pair of events.
- Neyman-Scott model: the cluster mechanism is arbitrary.
A Cox model is a homogeneous Poisson model with a random intensity function. Let \(\Lambda(s)\) be a function with non-negative values defined at all locations \(s\) inside the domain. Then, conditional on \(\Lambda\) let \(X\) be a Poisson model with an intensity function \(\Lambda\). Then \(X\) will be a sample from a Cox model.
An example of a Cox model is the mixed Poisson process in which a random variable \(\Lambda\) is generated and then, conditional on \(\Lambda\), a homogeneous Poisson process with intensity \(\Lambda\) is generated.
Following are two samples from a Cox point process.
set.seed(3042)
par(mfrow = c(1, 2))
for (i in 1:2){
lambda <- rexp(n = 1, rate = 1/100)
X <- rpoispp(lambda)
plot(X)
}
par(mfrow = c(1, 1))The statistical moments of Cox models are defined in terms of the moments of \(\Lambda\). For instance, the intensity function of \(X\) is \(\lambda(s)\) = E[\(\Lambda(s)\)], where E[] is the expected value.
Cox models are convenient for describing clustered point pattern data. A Cox model is over-dispersed relative to a Poisson model (i.e. the variance of the number of events falling in any region of size A, is greater than the mean number of events in those regions). The Matern cluster model and the Thomas models are Cox models. Another common type of a Cox model is the log-Gaussian Cox processes (LGCP) model in which logarithm of \(\Lambda(s)\) is a Gaussian random function.
If you have a way of generating samples from a random function \(\Lambda\) of interest, then you can use the rpoispp() function to generate the Cox process. The intensity argument lambda of rpoispp() can be a function of x or y or a pixel image.
Another way to generate clustered point pattern data is by ‘thinning’. Thinning refers to deleting some of the events. With ‘independent thinning’ the fate of each event is independent of the fate of the other events. When independent thinning is applied to a homogeneous Poisson point pattern, the resulting point pattern consisting of the retained events is also Poisson. To simulate a inhibition process you can use a ‘thinning’ mechanism.
An example of this is Matern’s Model I model. Here a homogeneous Poisson model first generates a point pattern \(Y\), then any event in \(Y\) that lies closer than a distance \(r\) from another event is deleted. This results in point pattern data whereby close neighbor events do not exist.
plot(rMaternI(kappa = 70,
r = .05), main = "")
X <- rMaternI(kappa = 70,
r = .05)
X |>
Kest() |>
plot()
Changing \(\kappa\) and \(r\) will change the event intensity.
The various spatial models for event locations can be described with math. For instance, expanding on the earlier notation you write that a homogeneous Poisson model with intensity \(\lambda > 0\) has intensity \[\lambda(s, x) = \lambda\] where \(s\) is any location in the window W and \(x\) is the set of events.
Then the inhomogeneous Poisson model has conditional intensity \[\lambda(s, x) = \lambda(s)\]. The intensity \(\lambda(s)\) depends on a spatial trend or on an explanatory variable.
There is also a class of ‘Markov’ point process models that allow for clustering (or inhibition) due to event interaction. Markov refers to the fact that the interaction is limited to nearest neighbors. Said another way, a Markov point process generalizes a Poisson process in the case where events are pairwise dependent.
A Markov process with parameters \(\beta > 0\) and \(0 < \gamma < \infty\) with interaction radius \(r > 0\) has conditional intensity \(\lambda(s, x)\) given by
\[ \lambda(s, x) = \beta \gamma^{t(s, x)} \]
where \(t(s, x)\) is the number of events that lie within a distance \(r\) of location \(s\).
Three cases: - If \(\gamma = 1\), then \(\lambda(s, x) = \beta\) No interaction between events, \(\beta\) can vary with \(s\). - If \(\gamma < 1\), then \(\lambda(s, x) < \beta\). Events inhibit nearby events. - If \(\gamma > 1\), then \(\lambda(s, x) > \beta\). Events encourage nearby events.
Note the distinction between the interaction term \(\gamma\) and the trend term \(\beta\). Note: A similar distinction exists between autocorrelation \(\rho\) and trend \(\beta\) in spatial regression models.
More generally, you write the logarithm of the conditional intensity \(\log[\lambda(s, x)]\) as linear expression with two components.
\[ \log\big[\lambda(s, x)\big] = \theta_1 B(s) + \theta_2 C(s, x) \]
where the \(\theta\)’s are model parameters that need to be estimated.
The term \(B(s)\) depends only on location so it represents trend and explanatory variable (covariate) effects. It is the ‘systematic component’ of the model. The term \(C(s, x)\) represents stochastic interactions (dependency) between events.
Fitting and interpreting an inhibition model
The {spatstat} package contains functions for fitting statistical models to point pattern data. Models can include trend (to account for non-stationarity), explanatory variables (covariates), and event interactions of any order (in other words, interactions are not restricted to pairwise). Models are fit with the method of maximum likelihood and the method of minimum contrasts.
The method of maximum likelihood estimates the probability of the empirical \(K\) curve given the theoretical curve for various parameter values. Parameter values are chosen so as to maximize the likelihood of the empirical curve.
The method of minimum contrasts derives a cost function as the difference between the theoretical and empirical \(K\) curves. Parameter values for the theoretical curve are those that minimize this cost function.
The ppm() function is used to fit a spatial point pattern model. The syntax has the form ppm(X, formula, interaction, ...) where X is the point pattern object of class ppp, formula describes the systematic (trend and covariate) part of the model, and interaction describes the stochastic dependence between events (e.g., Matern process).
Recall a plot of the Swedish pine saplings. There was no indication of a trend (no systematic variation in the intensity of saplings).
SP <- swedishpines
plot(SP)
intensity(SP)## [1] 0.007395833
There is no obvious spatial trend in the distribution of saplings and the average intensity is .0074 saplings per unit area.
A plot of the Ripley’s \(K\) function indicated regularity relative to CSR.
SP |>
Kest(correction = "iso") |>
plot()
The red dashed line is the \(K\) curve under CSR. The black line is the empirical curve. At lag distances of between 5 and 15 units the empirical curve is below the CSR curve indicating there are fewer events within other events at those scales than would be expected by chance.
This suggests a physical process whereby saplings tend to compete for sunlight, nutrients, etc. A process of between-event inhibition. If you suspect that the spatial distribution of event locations is influenced by inhibition you can model the process statistically.
A simple inhibition model is a Strauss process when the inhibition is constant with a fixed radius (r) around each event. The amount of inhibition ranges between zero (100% chance of a nearby event) to complete (0% chance of a nearby event). In the case of no inhibition the process is equivalent to a homogeneous Poisson process.
If you assume the inhibition process is constant across the domain with a fixed interaction radius (r), then you can fit a Strauss model to the data. You use the ppm() function from the {spatstat} package and include the point pattern data as the first argument. You set the trend term to a constant (implying a stationary process) with the argument trend ~ 1 and the interaction radius to 10 units with the argument interaction = Strauss(r = 10). Finally you use a border correction out to a distance of 10 units from the window with the rbord = argument.
Save the output in the object called model.in (inhibition model).
model.in <- ppm(SP,
trend = ~ 1,
interaction = Strauss(r = 10),
rbord = 10)The value for r in the Strauss() function is based on our visual inspection of the plot of Kest(). A value is chosen that represents the distance at which there is the largest departure from a CSR model.
You inspect the model parameters by typing the object name.
model.in## Stationary Strauss process
##
## First order term: beta = 0.07567442
##
## Interaction distance: 10
## Fitted interaction parameter gamma: 0.2752048
##
## Relevant coefficients:
## Interaction
## -1.29024
##
## For standard errors, type coef(summary(x))
The first-order term (beta) has a value of .0757. This is the intensity of the ‘proposal’ events. The value of beta exceeds the average intensity by a factor of ten.
Recall the intensity of the events is obtained as
intensity(SP)## [1] 0.007395833
The interaction parameter (gamma) is .275. It is less than one, indicating an inhibition process. The logarithm of gamma, called the interaction coefficient (Interaction), is -1.29. Interaction coefficients less than zero imply inhibition.
A table with the coefficients including the standard errors and uncertainty ranges is obtained with the coef() method.
model.in |>
summary() |>
coef()## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) -2.581315 0.4524077 -3.468018 -1.6946123 *** -5.705728
## Interaction -1.290240 0.2375515 -1.755832 -0.8246475 *** -5.431411
The output includes the Interaction coefficient along with it’s standard error (S.E.) and the associated 95% uncertainty interval. The ratio of the Interaction coefficient to its standard error is the Zval. A large z-value (in absolute magnitude) translates to a low \(p\)-value and a rejection of the null hypothesis of no interaction between events.
Output also is the estimated value for the (Intercept) term. It is the logarithm of the beta value, so exp(-2.58) = .0757 is the intensity of the proposal events.
You interpret the model output as follows. The process producing the spatial pattern of pine saplings is such that you should see .0757 saplings per unit area [unobserved (latent) rate].
But because of event inhibition, where saplings nearby other saplings fail to grow, the number of saplings is reduced to .0074 per unit area. Thus the spatial pattern is suggestive of sibling-sibling interaction. Adults have many offspring, but only some survive due to limited resources.
Thursday November 3, 2022
“Sometimes it pays to stay in bed on Monday, rather than spending the rest of the week debugging Monday’s code.” - Christopher Thompson
Today
- Fitting and interpreting a cluster model
- Assessing how well the model fits
- Spatial logistic regression
Fitting and interpreting a cluster model
Let’s compare the inhibition model fit previously to describe the Swedish pine saplings data with a cluster model for describing the Lansing Woods maple trees (in the ppp object called lansing from the {spatstat} package).
Start by extracting the events marked as maple and putting them in a separate ppp object called MT.
suppressMessages(library(spatstat))
data(lansing)
summary(lansing)## Marked planar point pattern: 2251 points
## Average intensity 2251 points per square unit (one unit = 924 feet)
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 3 decimal places
## i.e. rounded to the nearest multiple of 0.001 units (one unit = 924 feet)
##
## Multitype:
## frequency proportion intensity
## blackoak 135 0.05997335 135
## hickory 703 0.31230560 703
## maple 514 0.22834300 514
## misc 105 0.04664594 105
## redoak 346 0.15370950 346
## whiteoak 448 0.19902270 448
##
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit
## Unit of length: 924 feet
MT <- lansing |>
subset(marks == "maple") |>
unmark()
summary(MT)## Planar point pattern: 514 points
## Average intensity 514 points per square unit (one unit = 924 feet)
##
## Coordinates are given to 3 decimal places
## i.e. rounded to the nearest multiple of 0.001 units (one unit = 924 feet)
##
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit
## Unit of length: 924 feet
There are 514 maple trees over this square region (924 x 924 square feet).
Plots of the tree locations and the local intensity function help you understand the first-order property of these data.
MT |>
density() |>
plot()
plot(MT, add = TRUE)
There are maple trees across the southern and central parts of the study domain.
A plot of the \(G\) function summarizes the second-order properties under the assumption of no trend.
library(ggplot2)
G.df <- MT |>
Gest() |>
as.data.frame() |>
dplyr::filter(r < .033) |>
dplyr::mutate(r = r * 924)
ggplot(G.df, aes(x = r, y = km)) +
geom_line() +
geom_line(aes(y = theo), color = "blue") +
geom_vline(xintercept = 18, lty = 'dashed') +
xlab("Lag distance (ft)") + ylab("G(r): Cumulative % of events within a distance r of another maple") +
theme_minimal()
The plot provides evidence that the maple trees are clustered. The empirical curve is above the theoretical curve. For example about 74% of the maple trees are within 18 feet of another maple tree (vertical blue line). If the trees were arranged as CSR then only 49% of the trees would be within 18 feet of another maple.
Is the clustering due to interaction or trends (or both)?
You start the modeling process by investigating event interaction using a stationary Strauss model with interaction radius of .019 units (18 ft).
ppm(MT,
trend = ~ 1,
interaction = Strauss(r = .019))## Stationary Strauss process
##
## First order term: beta = 344.625
##
## Interaction distance: 0.019
## Fitted interaction parameter gamma: 1.7253743
##
## Relevant coefficients:
## Interaction
## 0.545444
##
## For standard errors, type coef(summary(x))
##
## *** Model is not valid ***
## *** Interaction parameters are outside valid range ***
Here the first order term beta is 345. It is the ‘latent’ rate (intensity) of maple trees per unit area. This rate is less than the 514 actual maple trees. The fitted interaction parameter (gamma) is 1.72. It is greater than one since the trees are clustered. The logarithm of gamma is positive at .545.
The model is interpreted as follows. The process producing the maple trees is such that you expect to see about 345 maples. Because of clustering where maple trees are more likely in the vicinity of other maple trees, the number of maples increases to the observed 514 per unit area.
Here the physical explanation could be event interaction. But it also could be explained by inhibition with hickory trees. You can model this using a term for cross event type interaction.
The Strauss process is for inhibition models. So although you use it here for diagnostics, you need to fit a cluster model (thus the *** Model is not valid *** warning).
For a cluster model the spatial intensity \[\lambda(s) = \kappa \mu(s)\] where \(\kappa\) is the average number of clusters and where \(\mu(s)\) is the spatial varying cluster size (number events per cluster).
Cluster models are fit using the kppm() function from the {spatstat} package. Here you specify the cluster process with clusters = "Thomas".
That means each cluster consists of a Poisson number of maple trees and where each tree in the cluster is placed randomly about the ‘parent’ tree with intensity that varies inversely with distance from the parent as a Gaussian function.
( model.cl <- kppm(MT,
trend = ~ 1,
clusters = "Thomas") )## Stationary cluster point process model
## Fitted to point pattern dataset 'MT'
## Fitted by minimum contrast
## Summary statistic: K-function
##
## Uniform intensity: 514
##
## Cluster model: Thomas process
## Fitted cluster parameters:
## kappa scale
## 21.74344366 0.06752959
## Mean cluster size: 23.63931 points
Here \(\kappa\) is 21.75 and \(\bar \mu(s)\) (mean cluster size) is 23.6 trees. The product of kappa and the mean cluster size is the number of events. The cluster model describes a parent-child process. The number of parents is about 22. The distribution of the parents can be described as CSR. Each parent produces about 24 offspring distributed randomly about the location of the parent within a characteristic distance. Note: The physical process might be different from the statistical process used to describe it.
The cluster scale parameter indicating the characteristic size (area units) of the clusters is \(\sigma^2\).
A plot() method verifies that the cluster process statistically ‘explains’ the spatial correlation.
plot(model.cl,
what = "statistic")
The model (black line) is very close to the cluster process line (red dashed line). Also note that it is far from the CSR model (green line).
The spatial scale of the clustering is visualized with the what = "cluster" argument.
plot(model.cl,
what = "cluster")
The color ramp is the spatial intensity (number of events per unit area) about an arbitrary single event revealing the spatial scale and extent of clustering.
Assessing how well the model fits
Workflow in fitting spatial event location models
- Analyze/plot the intensity and nearest neighbor statistics
- Select a model including trend, interaction distance, etc informed by the results of step 1
- Choose an inhibition or cluster model
- Fit the model to the event pattern
- Assess how well the model fits the data by generating samples and comparing statistics from the samples with the statistics from the original data
The model should be capable of generating samples of event locations that are statistically indistinguishable from the actual event locations.
Note: The development of spatial point process methods has largely been theory driven (not by actual problems/data). More work needs to be done to apply the theory to environmental data with spatial heterogeneity, properties at the individual level (marks), and with time information.
You produce samples of event locations with the simulate() function applied to the model object.
Let’s return to the Swedish pine sapling data and the inhibition model.
SP <- swedishpines
model.in <- ppm(SP,
trend = ~ 1,
interaction = Strauss(r = 10),
rbord = 10)Here you generate three samples of the Swedish pine sapling data and plot them alongside the actual data for comparison.
X <- model.in |>
simulate(nsim = 3)## Generating 3 simulated patterns ...1, 2, 3.
par(mfrow = c(2, 2))
plot(SP)
plot(X[[1]])
plot(X[[2]])
plot(X[[3]])
The samples of point pattern data look similar to the actual data providing evidence that the inhibition model is adequate.
To quantitatively assess the similarity use the envelope() function that computes the \(K\) function on 99 samples and the actual data. The \(K\) function values are averaged over all samples and a mean line represents the best model curve. Uncertainty is assessed with a band that ranges from the minimum to the maximum K at each distance.
Do this with the inhibition model for the pine saplings. This takes a few seconds to complete.
par(mfrow = c(1, 1))
plot(envelope(model.in,
fun = Kest,
nsim = 99,
correction = 'border'), legend = FALSE)## Generating 99 simulated realisations of fitted Gibbs model ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.

The black line is the empirical (data) curve and the red line is the average over the 99 samples. The two lines are close and the black line falls nearly completely within the gray uncertainty band indicating the model fits the data well. The kink in the red curve is the result of specifying 10 units for the interaction distance.
From this plot you confidently conclude that a homogeneous inhibition model is adequate for describing the pine sapling data.
What about the model for the maple trees? The model is saved as model.cl.
plot(envelope(model.cl,
fun = Kest,
nsim = 99,
correction = 'border'), legend = FALSE)## Generating 99 simulated realisations of fitted cluster model ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.

In the case of the maple trees, a cluster model is adequate. However, it is not satisfying since you know about the potential for inhibition caused by the presence of hickory trees.
Also you saw that there were more trees in the south than in the north so the stationary assumption is suspect.
You fit a second cluster model where the intensity is a linear function of distance in the north-south direction.
model.cl2 <- kppm(MT,
trend = ~ y,
clusters = "Thomas")
model.cl2## Inhomogeneous cluster point process model
## Fitted to point pattern dataset 'MT'
## Fitted by minimum contrast
## Summary statistic: inhomogeneous K-function
##
## Log intensity: ~y
##
## Fitted trend coefficients:
## (Intercept) y
## 6.894933 -1.486252
##
## Cluster model: Thomas process
## Fitted cluster parameters:
## kappa scale
## 26.955877 0.053585
## Mean cluster size: [pixel image]
This is an inhomogeneous cluster point process model. The logarithm of the intensity depends on y (Log intensity: ~y). The fitted trend coefficient is negative as expected, since there are fewer trees as you move north (increasing y direction). There is one spatial unit in the north-south direction so you interpret this coefficient to mean there are 77% fewer trees in the north than in the south. The 77% comes from the formula 1 - exp(-1.486) = .77.
The average number of clusters (kappa) is higher at about 27 (it was 22 for stationary model). The cluster scale parameter (sigma), indicating the characteristic size of the cluster (in distance units) is lower at .0536. That makes sense since some of the event-to-event distance is accounted for by the trend term.
Simulate data using the new model and compare the inhomogenous \(K\) function between the simulations and the observed data.
plot(envelope(model.cl2,
fun = Kinhom,
nsim = 99,
correction = 'border'), legend = FALSE)## Generating 99 simulated realisations of fitted cluster model ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
## 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99.
##
## Done.

The black line falls within the gray band and the gray band is narrower than the simulations using the homogeneous cluster model.
Tropical trees
If the intensity of events depends on spatial location as it does with the maple trees you can include a trend and covariate term in the model.
For a trend term, the formula ~ x corresponds to a spatial trend of the form \(\lambda(x) = \exp(a + bx)\), while ~ x + y corresponds to \(\lambda(x, y) = \exp(a + bx + cy)\) where x, y are the spatial coordinates. For a covariates, the formula is ~ covariate1 + covariate2.
Consider the bei data from the {spatstat} package containing the locations of 3605 trees in a tropical rain forest.
plot(bei)
Accompanied by covariate data giving the elevation (altitude) and slope of elevation in the study region. The data bei.extra is a list containing two pixel images, elev (elevation in meters) and grad (norm of elevation gradient). These pixel images are objects of class im, see im.object.
image(bei.extra)
Compute and plot the \(K\) function on the ppp object bei.
plot(envelope(bei,
fun = Kest,
nsim = 39,
global = TRUE,
correction = "border"),
legend = FALSE)## Generating 39 simulations of CSR ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39.
##
## Done.

There is significant clustering indicated by the black line sitting far above the CSR line. There are more trees in the vicinity of other trees than expected by chance.
But how much of the clustering is due to variations in terrain?
You start by fitting a model that includes elevation and gradient as covariates without clustering. This is done with the trend = argument naming the image variables and including the argument covariates = indicating a data frame or, in this case, a list whose entries are image functions.
model1 <- ppm(bei,
trend = ~ elev + grad,
covariates = bei.extra)Check to see if elevation and gradient as explanatory variables are significant in the model.
summary(model1)## Point process model
## Fitting method: maximum likelihood (Berman-Turner approximation)
## Model was fitted using glm()
## Algorithm converged
## Call:
## ppm.ppp(Q = bei, trend = ~elev + grad, covariates = bei.extra)
## Edge correction: "border"
## [border correction distance r = 0 ]
## --------------------------------------------------------------------------------
## Quadrature scheme (Berman-Turner) = data + dummy + weights
##
## Data pattern:
## Planar point pattern: 3604 points
## Average intensity 0.00721 points per square metre
## Window: rectangle = [0, 1000] x [0, 500] metres
## Window area = 5e+05 square metres
## Unit of length: 1 metre
##
## Dummy quadrature points:
## 130 x 130 grid of dummy points, plus 4 corner points
## dummy spacing: 7.692308 x 3.846154 metres
##
## Original dummy parameters: =
## Planar point pattern: 16904 points
## Average intensity 0.0338 points per square metre
## Window: rectangle = [0, 1000] x [0, 500] metres
## Window area = 5e+05 square metres
## Unit of length: 1 metre
## Quadrature weights:
## (counting weights based on 130 x 130 array of rectangular tiles)
## All weights:
## range: [1.64, 29.6] total: 5e+05
## Weights on data points:
## range: [1.64, 14.8] total: 41000
## Weights on dummy points:
## range: [1.64, 29.6] total: 459000
## --------------------------------------------------------------------------------
## FITTED MODEL:
##
## Nonstationary Poisson process
##
## ---- Intensity: ----
##
## Log intensity: ~elev + grad
## Model depends on external covariates 'elev' and 'grad'
## Covariates provided:
## elev: im
## grad: im
##
## Fitted trend coefficients:
## (Intercept) elev grad
## -8.56355220 0.02143995 5.84646680
##
## Estimate S.E. CI95.lo CI95.hi Ztest Zval
## (Intercept) -8.56355220 0.341113849 -9.23212306 -7.89498134 *** -25.104675
## elev 0.02143995 0.002287866 0.01695581 0.02592408 *** 9.371155
## grad 5.84646680 0.255781018 5.34514522 6.34778838 *** 22.857313
##
## ----------- gory details -----
##
## Fitted regular parameters (theta):
## (Intercept) elev grad
## -8.56355220 0.02143995 5.84646680
##
## Fitted exp(theta):
## (Intercept) elev grad
## 1.909398e-04 1.021671e+00 3.460097e+02
The output shows that both elevation and elevation gradient are significant in explaining the spatial varying intensity of the trees.
Since the conditional intensity is on a log scale you interpret the elevation coefficient as follows: For a one meter increase in elevation the local spatial intensity increases by a amount equal to exp(.021) or 2%.
Check how well the model fits the data. Again this is done with the envelope() function using the model object as the first argument.
E <- envelope(model1,
fun = Kest,
nsim = 39,
correction = "border",
global = TRUE)## Generating 78 simulated realisations of fitted Poisson model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78.
##
## Done.
plot(E, main = "Inhomogeneous Poisson Model",
legend = FALSE)
You conclude that although elevation and elevation slope are significant in explaining the spatial distribution of trees, they do not explain all the clustering.
An improvement is made by adding a cluster process to the model. This is done with the function kppm().
model2 <- kppm(bei,
trend = ~ elev + grad,
covariates = bei.extra,
clusters = "Thomas")
E <- envelope(model2, Lest, nsim = 39,
global = TRUE,
correction = "border")## Generating 78 simulated realisations of fitted cluster model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78.
##
## Done.
plot(E, main = "Clustered Inhomogeneous Model", legend = FALSE)
The uncertainty band is much wider. The empirical curve fits completely inside the band so you conclude that an inhomogeneous cluster process appears to be an adequate description of the point pattern data.
Violent tornadoes
The vast majority of tornadoes have winds of less than 60 m/s (120 mph). A violent tornado, with winds exceeding 90 m/s, is rare. Most of these potentially destructive and deadly tornadoes occur from rotating thunderstorms called supercells, with formation contingent on local (storm-scale) meteorological conditions.
The long-term risk of a tornado at a given location is assessed using historical records, however, the rarity of the most violent tornadoes make these rate estimates unstable. Here you use the more stable rate estimates from the larger set of less violent tornadoes to create more reliable estimates of violent tornado frequency.
For this exercise attention is restricted to tornadoes occurring in Kansas over the period 1954–2020.
Torn.sf <- sf::st_read(dsn = here::here("data", "1950-2020-torn-initpoint")) |>
sf::st_transform(crs = 3082) |>
dplyr::filter(mag >= 0, yr >= 1954) |>
dplyr::mutate(EF = mag,
EFf = as.factor(EF)) |>
dplyr::select(yr, EF, EFf)## Reading layer `1950-2020-torn-initpoint' from data source
## `/Users/jelsner/Desktop/ClassNotes/ASS-2022/data/1950-2020-torn-initpoint'
## using driver `ESRI Shapefile'
## Simple feature collection with 66244 features and 22 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -163.53 ymin: 17.7212 xmax: -64.7151 ymax: 61.02
## Geodetic CRS: WGS 84
W.sfc <- USAboundaries::us_states(states = "Kansas") |>
sf::st_transform(crs = sf::st_crs(Torn.sf)) |>
sf::st_geometry()
Torn.sf <- Torn.sf[W.sfc, ]Create a owin and ppp objects. Note that although you already subset by Kansas tornadoes above you need to subset on the ppp object to assign the KS boundary as the analysis window.
KS.win <- W.sfc |>
as.owin()
T.ppp <- Torn.sf["EF"] |>
as.ppp()
T.ppp <- T.ppp[KS.win]
summary(T.ppp)## Marked planar point pattern: 4139 points
## Average intensity 1.918005e-08 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## marks are numeric, of type 'double'
## Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.6139 1.0000 5.0000
##
## Window: polygonal boundary
## single connected closed polygon with 169 vertices
## enclosing rectangle: [1317675.9, 1980294.8] x [7114969, 7458570] units
## (662600 x 343600 units)
## Window area = 2.15797e+11 square units
## Fraction of frame area: 0.948
There are 4139 tornadoes over the period with an average intensity of 192 per 100 square kilometer (multiply the average intensity in square meters by 10^10).
Separate the point pattern data into non-violent tornadoes and violent tornadoes. The non-violent tornadoes include those with an EF rating of 0, 1, 2 or 3. The violent tornadoes include those with an EF rating of 4 or 5.
NV.ppp <- T.ppp |>
subset(marks <= 3 & marks >= 0) |>
unmark()
summary(NV.ppp)## Planar point pattern: 4098 points
## Average intensity 1.899006e-08 points per square unit
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## Window: polygonal boundary
## single connected closed polygon with 169 vertices
## enclosing rectangle: [1317675.9, 1980294.8] x [7114969, 7458570] units
## (662600 x 343600 units)
## Window area = 2.15797e+11 square units
## Fraction of frame area: 0.948
V.ppp <- T.ppp |>
subset(marks >= 4) |>
unmark()
summary(V.ppp)## Planar point pattern: 41 points
## Average intensity 1.899933e-10 points per square unit
##
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 units
##
## Window: polygonal boundary
## single connected closed polygon with 169 vertices
## enclosing rectangle: [1317675.9, 1980294.8] x [7114969, 7458570] units
## (662600 x 343600 units)
## Window area = 2.15797e+11 square units
## Fraction of frame area: 0.948
The spatial intensity of the non-violent tornadoes is 190 per 100 sq km. The spatial intensity of the violent tornadoes is 1.9 per 100 square kilometer.
Plot the locations of the violent tornado events.
plot(V.ppp)
Early we found that the spatial intensity of tornado reports was a function of distance to nearest city.
So here you include this as an explanatory variable. Import the data, set the CRS, and transform the CRS to match that of the tornadoes. Exclude cities with fewer than 1000 people.
C.sf <- USAboundaries::us_cities() |>
dplyr::filter(population >= 1000) |>
sf::st_transform(crs = sf::st_crs(Torn.sf))## City populations for contemporary data come from the 2010 census.
Then convert the simple feature data frame to a ppp object. Then subset the events by the analysis window (Kansas border).
C.ppp <- C.sf |>
as.ppp()## Warning in as.ppp.sf(C.sf): only first attribute column is used for marks
C.ppp <- C.ppp[KS.win] |>
unmark()
plot(C.ppp)
Next create a distance map of the city locations using the distmap() function.
Zc <- distmap(C.ppp)
plot(Zc)
The pixel values of the im object are distances is meters. Blue indicates locations that are less than 20 km from a city.
Interest lies with the distance to nearest non-violent tornado. You check to see if this might be a useful variable in a model so you make a distance map for the non-violent events and then use the rhohat() function.
Znv <- distmap(NV.ppp)
rhat <- rhohat(V.ppp, Znv,
adjust = 1.5,
smoother = "kernel",
method = "transform")
dist <- rhat$Znv
rho <- rhat$rho
hi <- rhat$hi
lo <- rhat$lo
Rho.df <- data.frame(dist = dist, rho = rho, hi = hi, lo = lo)
ggplot(Rho.df) +
geom_ribbon(aes(x = dist, ymin = lo, ymax = hi), alpha = .3) +
geom_line(aes(x = dist, y = rho), col = "black") +
ylab("Spatial intensity of violent tornadoes") + xlab("Distance from nearest non-violent tornado (m)") +
theme_minimal()
This shows that regions that get non-violent tornadoes also see higher rates of violent tornadoes.
So the model should include two covariates (trend terms), distance to nearest city and distance to nearest non-violent tornado.
model1 <- ppm(V.ppp,
trend = ~ Zc + Znv,
covariates = list(Zc = Zc, Znv = Znv))
coef(summary(model1))## Estimate S.E. CI95.lo CI95.hi Ztest
## (Intercept) -2.079665e+01 3.689920e-01 -2.151986e+01 -2.007344e+01 ***
## Zc -3.213231e-05 1.118327e-05 -5.405111e-05 -1.021350e-05 **
## Znv -2.235788e-04 8.585891e-05 -3.918592e-04 -5.529845e-05 **
## Zval
## (Intercept) -56.360705
## Zc -2.873248
## Znv -2.604026
As expected the model shows fewer violent tornadoes with increasing distance from the nearest city (negative coefficient on Zc) and fewer violent tornadoes with increasing distance from a non-violent tornado (negative coefficient on Znv).
Since the spatial unit is meters the coefficient of -3.06e-05 is interpreted as a [1 - exp(-.0306)] * 100% or 3% decrease in violent tornado reports per kilometer of distance from a city. Similarly the coefficient on distance from nearest non-violent tornado is interpreted as a 23% decrease in violent tornado reports per kilometer of distance from nearest non-violent tornado.
Check if there is any residual nearest neighbor correlation.
E <- envelope(model1,
fun = Kest,
nsim = 39,
global = TRUE)## Generating 78 simulated realisations of fitted Poisson model (39 to estimate
## the mean and 39 to calculate envelopes) ...
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
## 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78.
##
## Done.
plot(E, main = "Inhomogeneous Poisson Model", legend = FALSE)
There appears to be a bit of regularity at smaller scales. The empirical curve (black line) falls slightly below the model (dashed red line). There are fewer nearby violent tornadoes than one would expect.
To see if this is statistically significant, you add an inhibition process to the model.
model2 <- ppm(V.ppp,
trend = ~ Zc + Znv,
covariates = list(Zc = Zc, Znv = Znv),
interaction = Strauss(r = 40000))
coef(summary(model2))## Estimate S.E. CI95.lo CI95.hi Ztest
## (Intercept) -1.999626e+01 0.6389281922 -2.124853e+01 -1.874398e+01 ***
## Zc -4.125674e-05 0.0000129859 -6.670864e-05 -1.580484e-05 **
## Znv -2.325491e-04 0.0001163074 -4.605075e-04 -4.590757e-06 *
## Interaction -6.232454e-01 0.3926001130 -1.392727e+00 1.462367e-01
## Zval
## (Intercept) -31.296564
## Zc -3.177041
## Znv -1.999435
## Interaction -1.587481
The interaction coefficient has a negative sign as expected from the above plot, but the standard error is relatively large so it is not significant.
Remove the inhibition process and add a trend term in the east-west direction.
model3 <- ppm(V.ppp,
trend = ~ Zc + Znv + x,
covariates = list(Zc = Zc, Znv = Znv))
coef(summary(model3))## Estimate S.E. CI95.lo CI95.hi Ztest
## (Intercept) -2.381531e+01 1.891801e+00 -2.752317e+01 -2.010745e+01 ***
## Zc -2.274246e-05 1.255697e-05 -4.735366e-05 1.868739e-06
## Znv -2.379710e-04 8.651254e-05 -4.075324e-04 -6.840952e-05 **
## x 1.681064e-06 1.020308e-06 -3.187026e-07 3.680830e-06
## Zval
## (Intercept) -12.588694
## Zc -1.811143
## Znv -2.750711
## x 1.647605
There is a significant eastward trend but it appears to confound the distance to city term. Why is this?
Plot simulated data.
plot(V.ppp)
plot(simulate(model1, nsim = 6))## Generating 6 simulated patterns ...1, 2, 3, 4, 5, 6.

Model one appears to due a good job simulating data that looks like the actual data.
Spatial logistic regression
Spatial logistic regression is a popular model for point pattern data. The study domain is divided into a grid of cells; each cell is assigned the value one if it contains at least one event, and zero otherwise.
Then a logistic regression models the presence probability \(p = P(Y = 1)\) as a function of explanatory variables \(X\) in the (matrix) form \[ \log \frac{p}{1-p} = \beta X \] where the left-hand side is the logit (log of the odds ratio) and the \(\beta\) are the coefficients to be determined.
If your data are stored as ppp objects, a spatial logistic model can be fit directly using functions from the {spatstat} package.
Let’s consider an example from the package (a good strategy in general when learning a new technique).
Consider the locations of 57 copper ore deposits (events) and 146 line segments representing geological ‘lineaments.’ Lineaments are linear features that consist of geological faults.
It is of interest to be able to predict the probability of a copper ore from the lineament pattern. The data are stored as a list in copper. The list contains a ppp object for the ore deposits and a psp object for the lineaments.
data(copper)
plot(copper$SouthPoints)
plot(copper$SouthLines, add = TRUE)
For convenience you first rotate the events (points and lines) by 90 degrees in the anticlockwise direction and save them as separate objects.
C <- rotate(copper$SouthPoints, pi/2)
L <- rotate(copper$SouthLines, pi/2)
plot(C)
plot(L, add = TRUE)
You summarize the planar point pattern data object C.
summary(C)## Planar point pattern: 57 points
## Average intensity 0.01020691 points per square km
##
## Coordinates are given to 2 decimal places
## i.e. rounded to the nearest multiple of 0.01 km
##
## Window: rectangle = [-158.233, -0.19] x [-0.335, 35] km
## (158 x 35.34 km)
## Window area = 5584.45 square km
## Unit of length: 1 km
There are 57 ore deposits over a region of size 5584 square km resulting in an intensity of about .01 ore deposits per square km.
Next you create a distance map of the lineaments to be used as a covariate.
D <- distmap(L)
plot(D)
Spatial logistic regression models are fit with the slrm() function from the {spatstat} package.
model.slr <- slrm(C ~ D)
model.slr## Fitted spatial logistic regression model
## Formula: C ~ D
## Fitted coefficients:
## (Intercept) D
## -4.72337865 0.07811134
The model says that the odds of a copper ore along a lineament (D = 0) is exp(-4.723) = .00888. This is slightly less than the overall intensity of .01.
The model also says that for every one unit (one kilometer) increase in distance from a lineament the expected change in the log odds is .0781 [exp(.0781) = 1.0812] or an 8.1% increase in the odds. Ore deposits are more likely between the lineaments.
The fitted method produces an image (raster) of the window giving the local probability of an ore deposit. The values are the probability of a random ore deposit in each pixel.
plot(fitted(model.slr))
plot(C, add = TRUE)
Integrating the predictions over the area equals the observed number of ore deposits.
sum(fitted(model.slr))## [1] 57
Thursday November 10, 2022
“Beyond basic mathematical aptitude, the difference between good programmers and great programmers is verbal ability.” – Marissa Mayer
Today
- Spatial data interpolation
- Computing the sample (empirical) variogram
Spatial data interpolation
In situ observations of the natural world are made at specific locations in space (and time). But we often want estimates of the values everywhere. The temperature reported at the airport is 15C, but what is it at my house 10 miles away?
We need to interpolate values observed at certain locations to values anywhere over the domain. To do this we assume the observations are taken from a continuous field (surface). Data observed or measured at locations across a continuous field are called geostatistical data. Examples: concentrations of heavy metals across a farm field, surface air pressures in cities across the country, air temperatures within a city during the night.
Local averaging, spline functions, or inverse-distance weighting are interpolation methods. If it is 20C five miles north of here and 30C files miles to the south, then it is 25C here. This type of interpolation is a reasonable first-order assumption. But these types of interpolation methods do not take into account spatial autocorrelation and do not estimate uncertainty about the interpolated values.
Kriging is statistical spatial interpolation. It is the centerpiece of what is called ‘geostatistics.’ The resulting surface (kriged surface) has three parts. (1) Spatial trend: an increase or decrease in the values that depends on direction or a covariate (co-kriging); (2) Local spatial autocorrelation. (3) Random variation. This should now sound familiar. Together the three components provide a model that is used to estimate values everywhere within a specified domain.
In short, geostatistics is used to quantify spatial correlation, predict values at locations where values were not observed, estimate uncertainty on the predicted values, and simulate data.
As we’ve done with areal data and point pattern data (Moran’s I, Ripley’s K), we begin with quantifying spatial autocorrelation. To get started we need some definitions.
- Statistical interpolation assumes the observed values are spatially homogeneous. This implies stationarity and continuity
- Stationarity means that the average difference in values between pairs of observations separated by a given distance (lag) is constant across the domain
- Continuity means that the spatial autocorrelation depends only on the lag (and orientation) between observations. That is; spatial autocorrelation is independent of location and can be described by a single function
- Stationarity and continuity allow different parts of the region to be treated as “independent” samples
Stationarity can be weak or intrinsic. Both assume the average of the difference in values at observations separated by a lag distance \(h\) is zero. That is, E\([z_i - z_j]\) = 0, where location \(i\) and location \(j\) are a (lag) distance \(h\) apart. This implies that the interpolated surface \(Z(s)\) is a random function with a constant mean (\(m\)) and a residual (\(\varepsilon\)).
\[ Z(s) = m + \varepsilon(s). \] The expected value (average across all values) in the domain is \(m\).
Weak stationarity assumes that the covariance is a function of lag distance \(h\).
\[ \hbox{cov}(z_i, z_j) = \hbox{cov}(h) \] where cov(\(h\)) is called the covariogram.
Intrinsic stationarity assumes the variance of the difference in values is a function of the lag distance.
\[ \hbox{var}(z_i - z_j) = \gamma(h), \] where \(\gamma(h)\) is called the variogram. This means that spatial autocorrelation is independent of location.
These assumptions are needed to get us started with statistical interpolation. If the assumptions are not met, we remove the trends in the data before spatially interpolating the residuals.
Computing the covariogram and the correlogram
In practice we focus on a model for the variogram \(\gamma(h)\). But to understand the variogram it helps to first consider the covariogram. This is because of our familiarity with the idea of nearby things being more correlated than things farther away.
To make things simple without loss in generality, we start with a 4 x 6 grid of equally spaced surface air temperatures across a field in degrees C.
21 21 20 19 18 19
26 25 26 27 29 28
32 33 34 35 30 28
34 35 35 36 32 31
Put the values into a data vector and determine the mean and variance.
temps <- c(21, 21, 20, 19, 18, 19,
26, 25, 26, 27, 29, 28,
32, 33, 34, 35, 30, 28,
34, 35, 35, 36, 32, 31)
mean(temps)## [1] 28.08333
var(temps)## [1] 34.60145
To start, you focus only on the covariance function in the north-south direction. To compute the sample covariance function you first compute the covariance between the observed values one distance unit apart. Using maths
\[ \hbox{cov}(0, 1) = 1/|N(1)| \sum (z_i - Z)(z_j - Z) \] where \(|N(1)|\) is the number of distinct observation pairs with a distance separation of one unit in the north-south direction and where \(Z\) is the average over all observations. We let zero in cov(0, 1) refer to the direction and we let one refer to the distance one unit apart. With this grid of observations \(|N(1)|\) = 18.
The equation for the covariance can be simplified to
\[ \hbox{cov}(0, 1) = 1/|N(1)| \sum z_i z_j - m_{-1} m_{+1} \] where \(m_{-1}\) is the average temperature over all rows except the first (northern most) and \(m_{+1}\) is the average temperature over all rows except the last (southern most).
To simplify the notation re-index the grid of temperatures using lexicographic (reading) order.
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
Then
mp1 <- mean(temps[1:18])
mm1 <- mean(temps[7:24])
cc <- sum(temps[1:18] * temps[7:24])/18
cc - mm1 * mp1## [1] 15.01852
Or more generally
N <- 18
k <- 1:N
1/N * sum(temps[k] * temps[k + 6]) - mean(temps[k]) * mean(temps[k + 6])## [1] 15.01852
The covariance has units of the observed variable squared (here \(^\circ C^2\)).
You also have observation pairs two units of distance apart. So you compute the cov(0, 2) in a similar way. \[ \hbox{cov}(0, 2) = 1/|N(2)| \sum z_i z_j - m_{-2} m_{+2} \] where \(m_{-2}\) is the average temperature over all rows except the first two and \(m_{+2}\) is the average temperature over all rows except the last two. \(|N(2)|\) is the number of pairs two units apart.
N <- 12
k <- 1:N
1/N * sum(temps[k] * temps[k + 12]) - mean(temps[k]) * mean(temps[k + 12])## [1] 2.9375
Similarly you have observation pairs three units apart so you compute cov(0, 3) as \[ \hbox{cov}(0, 3) = 1/|N(3)| \sum z_i z_j - m_{-3} m_{+3} \]
N <- 6
k <- 1:N
1/N * sum(temps[k] * temps[k + 18]) - mean(temps[k]) * mean(temps[k + 18])## [1] 0.9444444
There are no observation pairs four units apart in the north-south direction so you are finished. The covariogram is a plot of the covariance values as a function of lag distance. Let \(h\) be the lag distance, then
| \(h\) | cov(\(h\)) |
|---|---|
| (0, 1) | 15 |
| (0, 2) | 3 |
| (0, 3) | 1 |
It is convenient to have a measure of co-variability that is dimensionless. So you divide the covariance at lag distance \(h\) by the covariance at lag zero. This is the correlogram. Values of the correlogram range from 0 to +1.
The covariogram is a decreasing function of lag distance. The variogram is the inverse (multiplicative) of the covariogram.
Mathematically: var(\(z_i - z_j\)) for locations \(i\) and \(j\). The semivariogram is 1/2 the variogram. If location \(i\) is near location \(j\), the difference in the values will be small and so too will the variance of their differences, in general. If location \(i\) is far from location \(j\), the difference in values will be large and so too will the variance of their differences.
In practice you have a set of observations and we compute a variogram. This is the sample (empirical) variogram. Let \(t_i = (x_i, y_i)\) be the ith location and \(h_{i,j} = t_j - t_i\) be the vector connecting location \(t_i\) with location \(t_j\). Then the sample variogram is defined as
\[ \gamma(h) = \frac{1}{2N(h)} \sum^{N(h)} (z_i - z_j)^2 \] where \(N(h)\) is the number of observation pairs a distance of \(h\) units apart.
The variogram assumes intrinsic stationarity so the raw observed values should not have a trend. If there is a trend it needs to be removed before computing the variogram.
The sample variogram is characterized by a set of points the values of which generally increase as \(h\) increases before leveling off (reaching a plateau).
As an example, you compute and plot the sample variogram from the meuse.all data frame from the {gstat} package. First attach the data frame and look at the first six rows.
library(gstat)##
## Attaching package: 'gstat'
## The following object is masked from 'package:spatstat.core':
##
## idw
data(meuse.all)
head(meuse.all)## sample x y cadmium copper lead zinc elev dist.m om ffreq soil
## 1 1 181072 333611 11.7 85 299 1022 7.909 50 13.6 1 1
## 2 2 181025 333558 8.6 81 277 1141 6.983 30 14.0 1 1
## 3 3 181165 333537 6.5 68 199 640 7.800 150 13.0 1 1
## 4 4 181298 333484 2.6 81 116 257 7.655 270 8.0 1 2
## 5 5 181307 333330 2.8 48 117 269 7.480 380 8.7 1 2
## 6 6 181390 333260 3.0 61 137 281 7.791 470 7.8 1 2
## lime landuse in.pit in.meuse155 in.BMcD
## 1 1 Ah FALSE TRUE FALSE
## 2 1 Ah FALSE TRUE FALSE
## 3 1 Ah FALSE TRUE FALSE
## 4 0 Ga FALSE TRUE FALSE
## 5 0 Ah FALSE TRUE FALSE
## 6 0 Ga FALSE TRUE FALSE
The data are locations and top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables, collected in a flood plain of the river Meuse, near the village Stein, NL. Heavy metal concentrations are bulk sampled from an area of approximately 15 m x 15 m.
Next locate where the data are from. First convert the data frame to a spatial data frame and then use functions from the {tmap} package in view mode.
meuse.sf <- meuse.all |>
sf::st_as_sf(coords = c("x", "y"),
crs = 28992)
tmap::tmap_mode("view")## tmap mode set to interactive viewing
tmap::tm_shape(meuse.sf) +
tmap::tm_bubbles(size = "zinc")## Legend for symbol sizes not available in view mode.
Then compute the sample variogram and save it as meuse.v.
meuse.v <- variogram(zinc ~ 1,
data = meuse.all,
locations = ~ x + y)
class(meuse.v)## [1] "gstatVariogram" "data.frame"
The output is an object of class gstatVariogram and data.frame. Plot the sample variogram and label the key features.
library(ggplot2)
ggplot(data = meuse.v,
mapping = aes(x = dist, y = gamma)) +
geom_point(size = 2) +
scale_y_continuous(limits = c(0, 210000)) +
geom_hline(yintercept = c(30000, 175000), color = "red") +
geom_vline(xintercept = 800, color = "red") +
xlab("Lag distance (h)") + ylab(expression(paste(gamma,"(h)"))) +
geom_segment(aes(x = 0, y = 0, xend = 0, yend = 30000,), arrow = arrow(angle = 15, length = unit(.3, "cm"))) +
geom_label(aes(x = 100, y = 10000, label = "nugget")) +
geom_segment(aes(x = 0, y = 10000, xend = 0, yend = 175000,), arrow = arrow(angle = 15, length = unit(.3, "cm"))) +
geom_label(aes(x = 180, y = 150000, label = "sill (partial sill)")) +
geom_segment(aes(x = 0, y = 190000, xend = 800, yend = 190000,), arrow = arrow(angle = 15, length = unit(.3, "cm"))) +
geom_label(aes(x = 250, y = 190000, label = "range")) +
theme_minimal()
- Lag (lag distance): Relative distance between observation locations (here units: meters)
- Nugget (nugget, nugget variance, or nugget effect): The height of the variogram at zero lag (here units ppm squared). The nugget is the variation in the values at the observation locations independent of spatial variation. It is related to the observation (or measurement) precision
- Sill: The height of the variogram at which the values are uncorrelated
- Relative nugget effect: The ratio of the nugget to the sill expressed as a percentage
- Range: The distance beyond which the values are uncorrelated. The range is indicated on the empirical variogram as the position along the horizontal axis where values of the variogram reach a constant height
Additional terms. - Isotropy: The condition in which spatial correlation is the same in all directions - Anisotropy: (an-I-so-trop-y) spatial correlation is stronger or more persistent in some directions - Directional variogram: Distance and direction are important in characterizing the spatial correlations. Otherwise the variogram is called omni-directional - Azimuth (\(\theta\)): Defines the direction of the variogram in degrees. The azimuth is measured clockwise from north - Lag spacing: The distance between successive lags is called the lag spacing or lag increment - Lag tolerance: The distance allowable for observational pairs at a specified lag. With arbitrary observation locations there will be no observations exactly a lag distance from any observation. Lag tolerance provides a range of distances to be used for computing values of the variogram at a specified lag.
Computing the sample variogram is the first step in modeling geostatistical data. The next step is fitting a model to the variogram. The model is important since the sample variogram estimates are made only at discrete lag distances (with specified lag tolerance and azimuth). You need a continuous function that varies smoothly across all lags. In short, the statistical model replaces the discrete set of points.
Variogram models come from different families. The fitting process first requires a decision about what family to choose and then given the family, a decision about what parameters (nugget, sill, range) to choose.
An exponential variogram model reaches the sill asymptotically. The range (a) is defined as the lag distance at which gamma reaches 95% of the sill.
c0 <- .1
c1 <- 2.1
a <- 1.3
curve(c0 + c1*(1 - exp(-3*x/a)),
from = .01, to = 3,
xlab = "h",
ylab = expression(paste(hat(gamma), "(h)")),
las = 1)
A spherical variogram model reaches the sill at x = 1.
curve(c0 + c1*(3*x/2 - x^3/2),
from = .01, to = 1,
xlab = "h",
ylab = expression(paste(hat(gamma), "(h)")),
las = 1)
A Gaussian variogram model is “S”-shaped (sigmodial). It is used when the data exhibit strong correlations at the shortest lag distances. The inflection point of the model occurs at \(\sqrt{a/6}\).
curve(c0 + c1*(1 - exp(-3*x^2/a^2)),
from = .01, to = 3,
xlab = "h",
ylab = expression(paste(hat(gamma), "(h)")),
las = 1)
Other families include
- Linear models: \(\hat \gamma(h)\) = c0 + b * h.
- Power models: \(\hat \gamma(h)\) = c0 + b * h\(^\lambda\).
These models have no sill.
Choosing a variogram family is largely done by looking at the shape of the sample variogram. Then, given a sample variogram computed from a set of spatial observations and a choice of family, the parameters of the variogram model are determined by weighted least-squares (WLS). Weighting is needed because the because the sample variogram estimates are computed using a varying number of point pairs.
There are other ways to determine the parameters including by eye, and by the method of maximum likelihoods, but WLS is less erratic than other methods and it requires fewer assumptions about the distribution of the data. And the process can be automated and it often is in high-level packages, but it is important to understand what is in the black box.
The final step in spatial statistical interpolation is called kriging. Kriging interpolates the observed data using the variogram model. It was developed by a South African miner (D.G. Krige) as a way to improve estimates of where ore reserves might be located. Extraction costs are reduced substantially if good predictions can be made of where the ore resides given samples taken across the mine.
A kriged estimate is a weighted average of the observations where the weights are based on the variogram model. The kriged estimates are optimal in the sense that they minimize the error variance. The type of kriging depends on the characteristics of the observations and the purpose of interpolation.
- Simple kriging assumes a known constant mean for the domain
- Ordinary kriging assumes an unknown constant mean
- Universal kriging assumes an unknown linear or nonlinear trend in the mean
To review, the steps for spatial interpolation (statistical) are:
- Examine the observations for trends and isotropy
- Compute a sample (empirical) variogram
- Fit a variogram model to the empirical variogram
- Create an interpolated surface using the variogram model together with the data (kriging)
Computing the sample variogram
The {gstat} package contains functions for spatial interpolation that take advantage of simple feature (and S4 class) spatial data data frames.
Suppose we have the following set of observations (zobs) at locations (sx, sy).
sx <- c(1.1, 3.2, 2.1, 4.9, 5.5, 7, 7.8, 9, 2.3, 6.9)
sy <- c(3, 3.5, 6, 1.5, 5.5, 3.2, 1, 4.5, 1, 7)
zobs <- c(-0.6117, -2.4232, -0.42, -0.2522, -2.0362, 0.9814, 1.842,
0.1723, -0.0811, -0.3896)Create a data frame and plot the observed values at the locations using the geom_text() function.
sf <- data.frame(sx, sy, zobs) |>
sf::st_as_sf(coords = c("sx", "sy"),
crs = 4326)
ggplot(data = sf,
mapping = aes(x = sx, y = sy, label = zobs)) +
geom_text() +
theme_minimal()
Lag distance (distance between locations) is the independent variable in the variogram function. You get all pairwise distances by applying the dist() function to a matrix of spatial coordinates.
dist(cbind(sx, sy))## 1 2 3 4 5 6 7 8
## 2 2.158703
## 3 3.162278 2.731300
## 4 4.085340 2.624881 5.300000
## 5 5.060632 3.047950 3.436568 4.044750
## 6 5.903389 3.811824 5.643580 2.701851 2.745906
## 7 6.992138 5.235456 7.582216 2.942788 5.053712 2.340940
## 8 8.041144 5.885576 7.061161 5.080354 3.640055 2.385372 3.700000
## 9 2.332381 2.657066 5.003998 2.647640 5.521775 5.189412 5.500000 7.559100
## 10 7.045566 5.093133 4.903060 5.852350 2.051828 3.801316 6.067125 3.264966
## 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10 7.560423
max(dist(cbind(sx, sy)))## [1] 8.041144
min(dist(cbind(sx, sy)))## [1] 2.051828
The dist() function computes a pairwise distance matrix. The distance between the first and second observation is 2.16 units and so on. The largest lag distance is 8.04 units and the smallest lag distance is 2.05 units.
The functions in the {gstat} package work with simple feature objects.
As another example, consider the dataset called topo from the {MASS} package. The data are topographic heights (feet) within a 310 sq ft domain.
Examine the data with a series of plots.
topo.df <- MASS::topo
p1 <- ggplot(data = topo.df,
mapping = aes(x = x, y = y, color = z)) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
p2 <- ggplot(data = topo.df,
mapping = aes(x = z, y = y)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme_minimal()
p3 <- ggplot(data = topo.df,
mapping = aes(x = x, y = z)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme_minimal()
p4 <- ggplot(data = topo.df,
mapping = aes(x = z)) +
geom_histogram(bins = 13) +
theme_minimal()
library(patchwork)
( p1 + p2 ) / ( p3 + p4 )## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Note the trend in the north-south direction and the skewness in the observed values.
Examine the residuals after removing a first-order trend from the observations.
topo.df$z1 <- residuals(lm(z ~ x + y, data = topo.df))
p1 <- ggplot(data = topo.df,
mapping = aes(x = x, y = y, color = z1)) +
geom_point() +
scale_color_viridis_c() +
theme_minimal()
p2 <- ggplot(data = topo.df,
mapping = aes(x = z1, y = y)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme_minimal()
p3 <- ggplot(data = topo.df,
mapping = aes(x = x, y = z1)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme_minimal()
p4 <- ggplot(data = topo.df,
mapping = aes(x = z1)) +
geom_histogram(bins = 13) +
theme_minimal()
( p1 + p2 ) / ( p3 + p4 )## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

The north-south trend is removed and the observations have a more symmetric distribution. There appears to be some non-linear trend (U-shape) in the east-west direction.
However, the residuals appear to show spatial autocorrelations (areas with above and below residuals).
Compare the empirical variograms using first the raw values and then the residuals after removing the first-order trend.
topo.sf <- topo.df |>
sf::st_as_sf(coords = c("x", "y"))
topo.v1 <- variogram(z ~ 1,
data = topo.sf,
cutoff = 2.5)
topo.v2 <- variogram(z1 ~ 1,
data = topo.sf,
cutoff = 2.5)
ggplot(data = topo.v1,
mapping = aes(x = dist, y = gamma)) +
geom_point(color = "red") +
geom_point(data = topo.v2,
mapping = aes(x = dist, y = gamma),
color = "black") +
scale_x_continuous(breaks = seq(0, 2.5, by = .25)) +
xlab("Lag distance (h)") +
ylab(expression(paste(gamma,"(h)"))) +
theme_minimal()
The semivariance (\(\gamma(u)\)) is plotted against lag distance. Values increase with increasing lag until a lag distance of about 2.
At large lags there are fewer estimates so the values have greater variance. A model for the semivariance is fit only for the the increasing portion of the graph.
The variogram values have units of square feet and are calculated using point pairs at lag distances within a lag tolerance. The number of point pairs depends on the lag so the variogram values are less precise at large distance.
Plot the number of point pairs used as a function of lag distance.
ggplot(data = topo.v2,
mapping = aes(y = np, x = dist)) +
geom_point() +
xlab("Lag Distance") + ylab("Number of Observation Pairs") +
theme_minimal()
Tuesday November 15, 2022
“Statistics is such a powerful language for describing data in ways that reveal nothing about their causes. Of course statistics is powerful for revealing causes as well. But it takes some care. Like the difference between talking and making sense.” - Richard McElreath
Today
- Fitting a variogram model to the sample variogram
- Creating an interpolated surface with the method of kriging
Fitting a variogram model to the sample variogram
Some years ago there were three nuclear waste repository sites being proposed in Nevada, Texas, and Washington. The proposed site needed to be large enough for more than 68,000 high-level waste containers placed underground, about 9 m (~30 feet) apart, in trenches surrounded by salt.
In July of 2002 the Congress approved Yucca Mountain, Nevada, as the nation’s first long-term geological repository for spent nuclear fuel and high-level radioactive waste.
The site must isolate the waste for 10,000 years. Leaks could occur, however, or radioactive heat could cause tiny quantities of water in the salt to migrate toward the heat until eventually each canister is surrounded by 22.5 liters of water (~6 gallons). A chemical reaction of salt and water can create hydrochloric acid that might corrode the canisters. The piezometric-head data at the site were obtained by drilling a narrow pipe into the aquifer and letting water seeks its own level in the pipe (piezometer).
The head measurements, given in units of feet above sea level, are from drill stem tests and indicate the total energy of the water in units of height. The higher the head height, the greater the potential energy. Water flows away from areas of high potential energy with aquifer discharge proportional to the gradient of the piezometric head. The data are in wolfcamp.csv on my website.
Examine the observed data for trends and check to see if the observations are adequately described by a normal distribution.
Import the data as a data frame from the csv file.
L <- "http://myweb.fsu.edu/jelsner/temp/data/wolfcamp.csv"
wca.df <- readr::read_csv(L, show_col_types = FALSE)Create a simple feature data frame and make a map showing the locations and the values for the head heights.
wca.sf <- sf::st_as_sf(x = wca.df,
coords = c("lon", "lat"),
crs = 4326)
tmap::tmap_mode("view")## tmap mode set to interactive viewing
tmap::tm_shape(wca.sf) +
tmap::tm_dots("head")You will use the spatial coordinates to model the spatial autocorrelation and to remove any spatial trends. So you include them as attributes in your spatial data frame.
XY <- wca.sf |>
sf::st_coordinates()
wca.sf$X <- XY[, 1]
wca.sf$Y <- XY[, 2]Are all observations at different locations? Duplicate coordinates might be due to an error or they might represent multiple measurements at a location.
You check for duplicates with the {base} duplicated() function applied to the geometry field.
wca.sf$geometry |>
duplicated()## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE
Observation 31 is a location that already has an observed head height.
You remove this observation from the data frame.
wca.sf <- wca.sf |>
dplyr::filter(!duplicated(geometry))
wca.sf$geometry |>
duplicated() |>
any()## [1] FALSE
Summarize the information in the spatial data frame.
wca.sf |>
summary()## head geometry X Y
## Min. :1024 POINT :84 Min. :-104.5 Min. :33.51
## 1st Qu.:1543 epsg:4326 : 0 1st Qu.:-102.4 1st Qu.:33.87
## Median :1787 +proj=long...: 0 Median :-101.7 Median :34.26
## Mean :1998 Mean :-101.7 Mean :34.55
## 3rd Qu.:2541 3rd Qu.:-100.8 3rd Qu.:35.31
## Max. :3571 Max. :-100.0 Max. :36.09
wca.sf |>
sf::st_bbox(wca.sf)## xmin ymin xmax ymax
## -104.55 33.51 -100.02 36.09
There are 84 well sites bounded between longitude lines 104.55W and 100.02W and latitude lines 33.51N and 36.09N.
The data values are summarized. The values are piezometric head heights in units of feet.
library(ggplot2)
ggplot() +
geom_sf(data = wca.sf,
mapping = aes(color = head)) +
scale_color_viridis_c() +
labs(col = "Height (ft)") +
theme_minimal()
There is a clear trend in head heights with the highest potential energy (highest heights) over the southwest (yellow) and lowest over the northeast (blue).
Compute and plot the empirical variogram using the variogram() function from the {gstat} package.
library(gstat)
variogram(head ~ 1,
data = wca.sf) |>
plot()
The continuously increasing set of variances with little fluctuation results from the data trend you saw in the map. There are at least two sources of variation in any set of geostatistical data: trend and spatial autocorrelation. Trend is modeled with a smooth curve and autocorrelation is modeled with the variogram.
Note: since the spatial coordinates are unprojected (decimal latitude/longitude) great circle distances are used and the units are kilometers.
You compute and plot the variogram this time with the trend removed. You replace the 1 with X + Y on the right hand side of the formula. The variogram is then computed on the residuals from the linear trend model.
variogram(head ~ X + Y,
data = wca.sf) |>
plot()
Here you see an increase in the variance with lag distance out to about 100 km, but then the values fluctuate about a variance of about 41000 (ft\(^2\)).
You save the variogram object computed on the residuals.
wca.v <- variogram(head ~ X + Y,
data = wca.sf)You then use this information contained in the variogram object to anticipate the type of variogram model.
df <- wca.v |>
as.data.frame()
( p <- ggplot(data = df,
mapping = aes(x = dist, y = gamma)) +
geom_point() +
geom_smooth() +
scale_y_continuous(limits = c(0, NA)) +
ylab(expression(paste("Variogram [", gamma,"(h)]"))) +
xlab("Lag distance (h)") +
theme_minimal() )## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The blue line is a least-squares regression smoother through the variogram estimates. The fact that it is not a flat horizontal line indicates spatial autocorrelation in the residuals (distinct from the first-order trend).
Directional variograms
The assumption of isotropy implies the same spatial autocorrelation function in all directions.
You compute variograms using observational pairs located along the same orientation to examine this assumption. Instead of considering all observational pairs within a lag distance \(h\) and lag tolerance \(\delta h\), you consider only pairs within a directional segment.
This is done with the alpha = argument specifying the direction in plane (x,y), in positive degrees clockwise from positive y (North): alpha = 0 for direction North (increasing y), alpha = 90 for direction East (increasing x).
Here you specify four directions (north-south-0, northeast-southwest-45, east-west-90, and southeast-northeast-135) and compute the corresponding directional variograms.
wca.vd <- variogram(head ~ X + Y,
data = wca.sf,
alpha = c(0, 45, 90, 135))
df <- wca.vd |>
as.data.frame() |>
dplyr::mutate(direction = factor(dir.hor))
ggplot(data = df,
mapping = aes(x = dist, y = gamma, color = direction)) +
geom_point() +
geom_smooth(alpha = .2) +
scale_y_continuous(limits = c(0, NA)) +
ylab(expression(paste("Variogram [", gamma,"(h)]"))) +
xlab("Lag distance (h)") +
theme_minimal()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The four variograms all have a similar shape and there is large overlap in the uncertainty bands surrounding the smooth curves so you conclude that the assumption of isotropy is reasonable.
Fit a variogram model to the empirical variogram
You are now ready to fit a variogram model to the empirical variogram. This amounts to fitting a parametric curve through the set of points that make up the empirical variogram.
Start by plotting the (omni-directional) empirical variogram saved in object p.
p## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The shape of the blue line gives you an idea of the type of variogram family of models you should consider.
Now you can guess at a family for the variogram model and eyeball the parameters. A spherical variogram model has a nearly linear increase in variances with lag distance before an abrupt flattening so that is a good choice.
The parameters for the model can be estimated from the graph as follows.
p +
geom_hline(yintercept = c(12000, 45000), color = "red") +
geom_vline(xintercept = 100, color = "red") +
geom_segment(aes(x = 0, y = 0, xend = 0, yend = 12000,), arrow = arrow(angle = 15, length = unit(.3, "cm"))) +
geom_label(aes(x = 15, y = 11000, label = "nugget")) +
geom_segment(aes(x = 0, y = 12000, xend = 0, yend = 45000,), arrow = arrow(angle = 15, length = unit(.3, "cm"))) +
geom_label(aes(x = 10, y = 44000, label = "sill")) +
geom_segment(aes(x = 0, y = 47000, xend = 100, yend = 47000,), arrow = arrow(angle = 15, length = unit(.3, "cm"))) +
geom_label(aes(x = 50, y = 48000, label = "range"))## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The three parameters used in fitting a variogram model are nugget, sill, and range.
Nugget (nugget, nugget variance, or nugget effect): The height of the variogram at zero lag. The nugget is the variation in the values at the observation locations without regard to spatial variation. Related to the observation (or measurement) precision.
Sill: The height of the variogram at which the values are uncorrelated. The sill is indicated by the height of the plateau in the variogram.
Range: The distance beyond which the values are uncorrelated. The range is indicated by distance along the horizontal axis from zero lag until the plateau in the variogram.
Other terms: (1) Relative nugget effect: The ratio of the nugget to the sill expressed as a percentage. (2) Lag distance: Relative distance between observation locations.
From the figure you estimate the sill at 44000 ft^2, the nugget at 12000 ft^2 and the range at 100 km.
To fit a model to the empirical variogram you start with the vgm() function that sets the curve family (here spherical) and the initial parameter values. You save result in an object called wca.vmi. The function needs the partial sill (psill = argument) as the difference between the sill and the nugget.
wca.vmi <- vgm(model = "Sph",
psill = 32000,
range = 100,
nugget = 12000)
wca.vmi## model psill range
## 1 Nug 12000 0
## 2 Sph 32000 100
Next you apply the function fit.variogram(), which uses the method of weighted least squares to improve the parameter estimates from the set of initial estimates. The function takes the empirical variogram and the set of initial estimates as object = and model =, respectively.
wca.vm <- fit.variogram(object = wca.v,
model = wca.vmi)
wca.vm## model psill range
## 1 Nug 9812.335 0.0000
## 2 Sph 34851.456 106.9623
Note: Ordinary least squares is not an appropriate method for fitting a variogram model to the empirical variogram because the semivariances are correlated across the lag distances and the precision on the estimates depends on the number of site pairs for a given lag.
The output table shows the nugget and spherical model. The nugget is 9812 ft^2 and the partial sill for the spherical model is 34851 ft^2 with a range of 107 km. These values are close to your initial estimates.
To check the model and fit plot them together with the plot() method.
plot(wca.v, wca.vm)
The blue line is the variogram model and the points are the empirical variogram values.
Note that the fit.variogram() function will find the optimal fit even if the initial values are not very good. Here you lower the partial sill to 10000 ft^2, reduce the range to 50 km and set the nugget to 8000 ft^2.
wca.vmi2 <- vgm(model = "Sph",
psill = 10000,
range = 50,
nugget = 8000)
wca.vm2 <- fit.variogram(object = wca.v,
model = wca.vmi2)
wca.vm2## model psill range
## 1 Nug 9812.412 0.0000
## 2 Sph 34851.684 106.9645
The initial values are poor but good enough for the fit.variogram() function to find the optimal model.
Compare the spherical model to a Gaussian and an exponential model.
wca.vmi3 <- vgm(model = "Gau",
psill = 30000,
range = 30,
nugget = 10000)
wca.vm3 <- fit.variogram(object = wca.v,
model = wca.vmi3)
wca.vmi4 <- vgm(model = "Exp",
psill = 30000,
range = 10,
nugget = 10000)
wca.vm4 <- fit.variogram(object = wca.v,
model = wca.vmi4)
plot(wca.v, wca.vm3)
plot(wca.v, wca.vm4)
The Gaussian model has a S-shaped curve (sigmodial) indicating more spatial autocorrelation at close distances. The exponential model has no plateau. Both models fit the empirical variogram values reasonably well.
In practice, the choice often makes very little difference in the quality of the spatial interpolation.
On the other hand, it is possible to optimize over all sets of variogram models and parameters using the autofitVariogram() function from the {automap} package. The package requires the data to be of S4 class but uses the functions from the {gstat} package.
Here you use the function on the Wolfcamp aquifer data.
wca.sp <- as(wca.sf, "Spatial")
wca.vm5 <- automap::autofitVariogram(formula = head ~ X + Y,
input_data = wca.sp)
plot(wca.vm5)
The automatic fitting results in a Matern model with Stein’s parameterization. The Matern family of variogram models has an additional parameter (besides the nugget, sill, and range) kappa that allows for local smoothing. With an extra parameter these models will generally outperform models with fewer parameters.
Creating an interpolated surface with the method of kriging
Kriging uses the variogram model together with the observed data to estimate values at any location of interest. The kriged estimates are a weighted average of the neighborhood values with the weights defined by the variogram model.
Estimates are often made at locations defined on a regular grid. Here you create a regular grid of locations within the boundary of the spatial data frame using the sf::st_make_grid() function. You specify the number of locations in the x and y direction using the argument n =. The what = "centers" returns the center locations of the grid cells as points.
grid.sfc <- sf::st_make_grid(wca.sf,
n = c(50, 50),
what = "centers")The result is a simple feature column (sfc) of points.
Plot the grid locations together with the observation locations.
sts <- USAboundaries::us_states()
tmap::tmap_mode("plot")## tmap mode set to plotting
tmap::tm_shape(wca.sf) +
tmap::tm_bubbles(size = .25) +
tmap::tm_shape(grid.sfc) +
tmap::tm_dots(col = "red") +
tmap::tm_shape(sts) +
tmap::tm_borders()
Since there is a trend term you need to add the X and Y values to the simple feature column of the grid. First make it a simple feature data frame then using then add the columns with dplyr::mutate().
XY <- grid.sfc |>
sf::st_coordinates()
grid.sf <- grid.sfc |>
sf::st_as_sf() |>
dplyr::mutate(X = XY[, 1],
Y = XY[, 2])Next interpolate the aquifer heights to the grid locations. You do this with the krige() function. The first argument is the formula for the trend, the locations argument is the observed data locations from the simple feature data frame, the new data argument is the locations and independent variables (in this case the trend variables) and the model argument is the variogram model that you fit above.
wca.int <- krige(head ~ X + Y,
locations = wca.sf,
newdata = grid.sf,
model = wca.vm)## [using universal kriging]
head(wca.int)## Simple feature collection with 6 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -104.5047 ymin: 33.5358 xmax: -104.0517 ymax: 33.5358
## Geodetic CRS: WGS 84
## var1.pred var1.var geometry
## 1 3618.148 36503.33 POINT (-104.5047 33.5358)
## 2 3594.685 36325.39 POINT (-104.4141 33.5358)
## 3 3571.307 36332.17 POINT (-104.3235 33.5358)
## 4 3546.016 35848.59 POINT (-104.2329 33.5358)
## 5 3520.252 34424.00 POINT (-104.1423 33.5358)
## 6 3494.375 31846.14 POINT (-104.0517 33.5358)
The output says using universal kriging. This is because there is a trend and a variogram model.
The output is a simple feature data frame containing the interpolated values at the grid locations in the column labeled var1.pred. The interpolated uncertainty is given in the column labeled var1.var.
You plot the interpolated aquifer heights at the grid locations using functions from the {ggplot2} package.
ggplot() +
geom_sf(data = wca.int,
mapping = aes(col = var1.pred)) +
scale_color_viridis_c() +
labs(col = "Height (ft)") +
theme_minimal()
The trend captures the large scale feature while the variogram captures the local spatial autocorrelation. Together they produce an interpolated surface that matches exactly the values at the observation locations (when the nugget is fixed at zero).
Plot the uncertainty in the estimated interpolated values as the square root of the variance.
ggplot() +
geom_sf(data = wca.int,
mapping = aes(col = sqrt(var1.var))) +
scale_color_viridis_c(option = "plasma") +
labs(col = "Uncertainty (ft)") +
theme_minimal()
Areas with the largest uncertainty are in areas away from observations (northwest corner). This makes sense our knowledge of the aquifer comes from these observations.
Thursday November 17, 2022
“The problem of nonparametric estimation consists in estimation, from the observations, of an unknown function belonging to a sufficiently large class of functions.” - A.B. Tsybakov
Today
- Comparing interpolation methods
- Evaluating the accuracy of the interpolation
Comparing interpolation methods
Here you consider a data set of monthly average surface air temperatures for April across the Midwest. The data are available on my website in the file MidwestTemps.txt.
Start by examining the data for spatial trends.
L <- "http://myweb.fsu.edu/jelsner/temp/data/MidwestTemps.txt"
t.sf <- readr::read_table(L, show_col_types = FALSE) |>
sf::st_as_sf(coords = c("lon", "lat"),
crs = 4326)
XY <- t.sf |>
sf::st_coordinates()
t.sf$X <- XY[, 1]
t.sf$Y <- XY[, 2]
t.sf$geometry |>
duplicated() |>
any()## [1] FALSE
Plot the values on a map.
sts <- USAboundaries::us_states()
tmap::tm_shape(t.sf) +
tmap::tm_text(text = "temp",
size = .6) +
tmap::tm_shape(sts) +
tmap::tm_borders() 
There is a clear trend in the temperature field with the coolest values to the north. Besides the trend there is some local clustering of similar values (spatial autocorrelation).
Compute and plot the empirical variogram on the residuals after removing the trend. The trend term is specified in the formula as temp ~ X + Y.
library(gstat)
t.v <- variogram(temp ~ X + Y,
data = t.sf)
plot(t.v)
Check for anisotropy. Specify four directions and compute the corresponding directional variograms.
t.vd <- variogram(temp ~ X + Y,
data = t.sf,
alpha = c(0, 45, 90, 135))
df <- t.vd |>
as.data.frame() |>
dplyr::mutate(direction = factor(dir.hor))
library(ggplot2)
ggplot(data = df,
mapping = aes(x = dist, y = gamma, color = direction)) +
geom_point() +
geom_smooth(alpha = .2) +
scale_y_continuous(limits = c(0, NA)) +
ylab(expression(paste("Variogram [", gamma,"(h)]"))) +
xlab("Lag distance (h)") +
theme_minimal()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There is no strong evidence to reject the assumption of isotropy.
Use the autofitVariogram() function to get initial estimates.
t.sp <- as(t.sf, "Spatial")
t.vm <- automap::autofitVariogram(formula = temp ~ X + Y,
input_data = t.sp)
plot(t.vm)
Set the initial parameters for a Gaussian model then fit the model.
t.vmi <- vgm(model = "Gau",
psill = 2,
range = 100,
nugget = 1)
t.vmi## model psill range
## 1 Nug 1 0
## 2 Gau 2 100
t.vm <- fit.variogram(object = t.v,
model = t.vmi)## Warning in fit.variogram(object = t.v, model = t.vmi): No convergence after 200
## iterations: try different initial values?
t.vm## model psill range
## 1 Nug 0.7979039 0.00000
## 2 Gau 1.9493490 74.88767
plot(t.v, t.vm)
Make a grid for the interpolated values and add the coordinates as attributes.
grid.sfc <- sf::st_make_grid(t.sf,
n = c(100, 100),
what = "centers")
XY <- grid.sfc |>
sf::st_coordinates()
grid.sf <- grid.sfc |>
sf::st_as_sf() |>
dplyr::mutate(X = XY[, 1],
Y = XY[, 2])Interpolate with universal kriging.
t.int <- krige(temp ~ X + Y,
locations = t.sf,
newdata = grid.sf,
model = t.vm)## [using universal kriging]
Map the output.
tmap::tm_shape(t.int) +
tmap::tm_dots(title = "°F",
shape = 15,
size = 2,
col = "var1.pred",
n = 9,
palette = "OrRd") +
tmap::tm_shape(sts) +
tmap::tm_borders() +
tmap::tm_shape(t.sf) +
tmap::tm_text("temp",
col = "white",
size = .5) +
tmap::tm_layout(legend.outside = TRUE)
The trend term captures the north-south temperature gradient and the variogram captures the local spatial autocorrelation. Together they provide the best interpolated surface.
To see this, you refit the interpolation without the variogram model.
krige(temp ~ X + Y,
locations = t.sf,
newdata = grid.sf) |>
tmap::tm_shape() +
tmap::tm_dots(title = "°F",
shape = 15,
size = 2,
col = "var1.pred",
n = 9,
palette = "OrRd") +
tmap::tm_shape(sts) +
tmap::tm_borders() +
tmap::tm_shape(t.sf) +
tmap::tm_text("temp",
col = "white",
size = .5) +
tmap::tm_layout(legend.outside = TRUE)## [ordinary or weighted least squares prediction]

The result is that the variation in temperatures is interpolated as a simple trend surface.
For another comparison, here you interpolate assuming all variation is spatial autocorrelation (no trend term). This is called ordinary kriging.
krige(temp ~ 1,
locations = t.sf,
newdata = grid.sf,
model = t.vm) |>
tmap::tm_shape() +
tmap::tm_dots(title = "°F",
shape = 15,
size = 2,
col = "var1.pred",
n = 9,
palette = "OrRd") +
tmap::tm_shape(sts) +
tmap::tm_borders() +
tmap::tm_shape(t.sf) +
tmap::tm_text("temp",
col = "white",
size = .5) +
tmap::tm_layout(legend.outside = TRUE)## [using ordinary kriging]

The result is that all variation is local autocorrelation. This produces patches of higher and lower temperatures.
The pattern obtained with ordinary kriging is similar to that obtained using inverse distance weighting. Inverse distance weighting (IDW) is a deterministic method for interpolation. The values assigned to locations are calculated with a weighted average of the values available at the known locations, where the weights are the inverse of the distance to each known location.
The krige() function does IDW when there is no trend and no variogram model given.
krige(temp ~ 1,
locations = t.sf,
newdata = grid.sf) |>
tmap::tm_shape() +
tmap::tm_dots(title = "°F",
shape = 15,
size = 2,
col = "var1.pred",
n = 9,
palette = "OrRd") +
tmap::tm_shape(sts) +
tmap::tm_borders() +
tmap::tm_shape(t.sf) +
tmap::tm_text("temp",
col = "white",
size = .5) +
tmap::tm_layout(legend.outside = TRUE)## [inverse distance weighted interpolation]

Simple kriging is ordinary kriging with a specified mean. This is done with the beta = argument.
krige(temp ~ 1,
beta = mean(t.sf$temp),
locations = t.sf,
newdata = grid.sf,
model = t.vm) |>
tmap::tm_shape() +
tmap::tm_dots(title = "°F",
shape = 15,
size = 2,
col = "var1.pred",
n = 9,
palette = "OrRd") +
tmap::tm_shape(sts) +
tmap::tm_borders() +
tmap::tm_shape(t.sf) +
tmap::tm_text("temp",
col = "white",
size = .5) +
tmap::tm_layout(legend.outside = TRUE)## [using simple kriging]

Evaluating the accuracy of the interpolation
How do you evaluate how good the interpolated surface is? If you use the variogram model to predict at the observation locations, you will get the observed values back.
For example, here you interpolate to the observation locations by setting newdata = t.sf instead of grid.sf. You then compute the correlation between the interpolated value and the observed value.
t.int2 <- krige(temp ~ X + Y,
locations = t.sf,
newdata = t.sf,
model = t.vm)## [using universal kriging]
cor(t.int2$var1.pred, t.sf$temp)## [1] 1
So this is not helpful.
Instead you use cross validation. Cross validation is a procedure for assessing how well a model does at predicting (interpolating) values when observations specific to the prediction are removed from the model fitting procedure. Cross validation partitions the data into two disjoint subsets and the model is fit to one subset of the data (training set) and the model is validated on the other subset (testing set).
Leave-one-out cross validation (LOOCV) uses all but one observation for fitting and the left-out observation for testing. The procedure is repeated with every observation taking turns being left out.
krige.cv(temp ~ X + Y,
locations = t.sf,
model = t.vm) |>
sf::st_drop_geometry() |>
dplyr::summarize(r = cor(var1.pred, observed),
rmse = sqrt(mean((var1.pred - observed)^2)),
mae = mean(abs(var1.pred - observed)))## r rmse mae
## 1 0.9452588 1.308603 1.027777
krige.cv(temp ~ 1,
locations = t.sf,
model = t.vm) |>
sf::st_drop_geometry() |>
dplyr::summarize(r = cor(var1.pred, observed),
rmse = sqrt(mean((var1.pred - observed)^2)),
mae = mean(abs(var1.pred - observed)))## r rmse mae
## 1 0.9018904 1.903057 1.403527
krige.cv(temp ~ X + Y,
locations = t.sf) |>
sf::st_drop_geometry() |>
dplyr::summarize(r = cor(var1.pred, observed),
rmse = sqrt(mean((var1.pred - observed)^2)),
mae = mean(abs(var1.pred - observed)))## r rmse mae
## 1 0.9057414 1.698922 1.351733
krige.cv(temp ~ 1,
locations = t.sf) |>
sf::st_drop_geometry() |>
dplyr::summarize(r = cor(var1.pred, observed),
rmse = sqrt(mean((var1.pred - observed)^2)),
mae = mean(abs(var1.pred - observed)))## r rmse mae
## 1 0.9294513 1.785536 1.346272
All four interpolations result in high correlation between observed and interpolated values that exceed .9 and root-mean-squared errors (RMSE) less than 1.8. But the universal kriging interpolation gives the highest correlation and the lowest RMSE and mean-absolute errors.
For a visual representation of the goodness of fit you plot the observed versus interpolated values from the cross validation procedure.
krige.cv(temp ~ X + Y,
locations = t.sf,
model = t.vm) |>
dplyr::rename(interpolated = var1.pred) |>
ggplot(mapping = aes(x = observed, y = interpolated)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
geom_smooth(method = lm, color = "red") +
ylab("Interpolated temperatures (°F)") +
xlab("Observed temperatures (°F)") +
theme_minimal()## `geom_smooth()` using formula 'y ~ x'

The black line represents a perfect prediction and the red line is the best fit line when you regress the interpolated temperatures onto the observed temperatures. The fact that the two lines nearly coincide indicates the interpolation is good.
The nfold = argument, which by default is set to the number of observations and does a LOOCV, allows you to divide the data into different size folds (instead of N-1).
Note that these performance metrics are biased toward the sample of data because cross validation is done only on the interpolation (kriging) and not on the variogram model fitting.
That is, with kriging the data is used in two ways (1) to fit the variogram model, and (2) to interpolate the values.
To perform a full LOOCV you need to refit the variogram after removing the observation for which you want the interpolation.
vmi <- vgm(model = "Sph",
psill = 2,
range = 200,
nugget = 1)
int <- NULL
for(i in 1:nrow(t.sf)){
t <- t.sf[-i, ]
v <- variogram(temp ~ X + Y,
data = t)
vm <- fit.variogram(object = v,
model = vmi)
int[i] <- krige(temp ~ X + Y,
locations = t,
newdata = t[i, ],
model = vm)$var1.pred
}The interpolation error using full cross validation will always be larger than the interpolation error using a fixed variogram model.
Block cross validation
One final note about cross validation in the context of spatial data is the observations are not independent. As such it is better to create spatial areas for training separate from the spatial areas for testing.
A nice introduction to so-called ‘block’ cross validation in the context of species distribution modeling is available here https://cran.r-project.org/web/packages/blockCV/vignettes/BlockCV_for_SDM.html
Thursday November 29, 2022
“Practice any art, music, singing, dancing, acting, drawing, painting, sculpting, poetry, fiction, essays, reportage, no matter how well or badly, not to get money & fame, but to experience becoming, to find out what’s inside you, to make your soul grow.” - Kurt Vonnegut
Today
- Interpolating to areal units
- Simulating spatial fields
- Interpolating multiple variables
- Machine learning for spatial interpolation
Interpolating to areal units (block kriging)
In 2008 tropical cyclone (TC) Fay formed from a tropical wave near the Dominican Republic, passed over the island of Hispaniola, Cuba, and the Florida Keys, then crossed the Florida peninsula and moved westward across portions of the Panhandle producing heavy rains in parts of the state.
Rainfall is an example of geostatistical data. In principle it can be measured anywhere, but typically you have values at a sample of sites. The pattern of observation sites is not of much interest as it is a consequence of constraints (convenience, opportunity, economics, etc) unrelated to the phenomenon. Interest centers on inference about how much rain fell across the region.
Storm total rainfall amounts from stations in and around the state are in FayRain.txt on my website. They are compiled reports from official weather sites and many cooperative sites. The cooperative sites are the Community Collaborative Rain, Hail and Snow Network (CoCoRaHS), a community-based, high density precipitation network made up of volunteers who take measurements of precipitation in their yards. The data were obtained from NOAA/NCEP/HPC and from the Florida Climate Center.
Import the data.
L <- "http://myweb.fsu.edu/jelsner/temp/data/FayRain.txt"
( FR.df <- readr::read_table(L) )##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## lon = col_double(),
## lat = col_double(),
## tpi = col_double(),
## tpm = col_double()
## )
## # A tibble: 803 × 4
## lon lat tpi tpm
## <dbl> <dbl> <dbl> <dbl>
## 1 -82.4 29.7 7.4 188.
## 2 -82.3 29.8 9.99 254.
## 3 -82.4 29.6 8.01 203.
## 4 -82.4 29.6 5.71 145.
## 5 -82.1 30.2 10.8 273.
## 6 -82.3 30.3 14.3 364.
## 7 -80.6 28.0 14.0 356.
## 8 -80.5 28.0 14.5 369.
## 9 -80.6 28.4 0 0
## 10 -80.7 28.3 13.6 346.
## # … with 793 more rows
The data frame contains 803 rainfall sites. Longitude and latitude coordinates of the sites are given in the first two columns and total rainfall in inches and millimeters are given in the second two columns.
Create a spatial points data frame by specifying columns that contain the spatial coordinates. Then assign a geographic coordinate system and convert the rainfall from millimeters to centimeters.
FR.sf <- sf::st_as_sf(x = FR.df,
coords = c("lon", "lat"),
crs = 4326) |>
dplyr::mutate(tpm = tpm/10)
summary(FR.sf$tpm)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 8.306 15.799 17.631 24.372 60.223
The median rainfall across all available observations is 15.8 cm and the highest is 60.2 cm.
Get the Florida county boundaries from the {USAboundaries} package.
FL.sf <- USAboundaries::us_counties(states = "Florida")Transform the geographic coordinates of the site locations and map polygons to projected coordinates. Here you use Florida GDL Albers (EPSG:3087) with meter as the length unit.
FR.sf <- sf::st_transform(FR.sf, crs = 3087)
FL.sf <- sf::st_transform(FL.sf, crs = 3087)
sf::st_crs(FR.sf)## Coordinate Reference System:
## User input: EPSG:3087
## wkt:
## PROJCRS["NAD83(HARN) / Florida GDL Albers",
## BASEGEOGCRS["NAD83(HARN)",
## DATUM["NAD83 (High Accuracy Reference Network)",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4152]],
## CONVERSION["Florida GDL Albers (meters)",
## METHOD["Albers Equal Area",
## ID["EPSG",9822]],
## PARAMETER["Latitude of false origin",24,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-84,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",24,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",31.5,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",400000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["State-wide spatial data management."],
## AREA["United States (USA) - Florida."],
## BBOX[24.41,-87.63,31.01,-79.97]],
## ID["EPSG",3087]]
Start by making a map of the rainfall sites and storm total rainfall that includes the state border.
tmap::tm_shape(FR.sf) +
tmap::tm_dots(col = "tpm", size = .5) +
tmap::tm_shape(FL.sf) +
tmap::tm_borders()
Two areas of very heavy rainfall are noted. One running north-south along the east coast and another across the north.
Rainfall reporting sites are clustered in and around cities and are located only over land. This type of station location arrangement will make it hard for deterministic interpolation methods (e.g., IDW or splines) to produce a reasonable surface.
The empirical variogram is computed using the variogram() function from the {gstat} package. The first argument is the model formula specifying the rainfall column from the data frame and the second argument is the data frame name. Here ~ 1 in the model formula indicates no covariates or trends in the data. Trends are included by specifying coordinate names through the st_coordinates() function.
Compute the empirical variogram for these set of rainfall values. Use a cutoff distance of 400 km (400,000 m). The cutoff is the separation distance up to which point pairs are included in the semivariogram. The smaller the cutoff value the more the variogram is focused on nearest neighbor locations.
library(gstat)
v <- variogram(tpm ~ 1,
data = FR.sf,
cutoff = 400000)Plot the variogram values as a function of lag distance and add text indicating the number of point pairs for each lag distance. Save a copy of the plot for later.
library(ggplot2)
v.df <- data.frame(dist = v$dist/1000,
gamma = v$gamma,
np = v$np)
( pv <- ggplot(v.df, aes(x = dist, y = gamma)) +
geom_point() +
geom_text(aes(label = np), nudge_y = -5) +
scale_y_continuous(limits = c(0, 220)) +
scale_x_continuous(limits = c(0, 400)) +
xlab("Lagged distance (h) [km]") +
ylab(expression(paste("Semivariance (", gamma, ") [", cm^2, "]"))) +
theme_minimal() )
Values start low (around 50 cm\(^2\)) at the shortest lag distance and increase to greater than 200 cm\(^2\) at lag distances of 200 km and longer.
The semivariance a lag zero is called the ‘nugget’ and the semivariance at a level where the variogram values no longer increase is called the ‘sill.’ The difference between the sill and the nugget is called the ‘partial sill’. The lag distance out to where the sill is reached is called the ‘range.’ These three parameters (nugget, partial sill, and range) are used to model the variogram.
Next fit a model to the empirical variogram. The model is a mathematical relationship that defines the semivariance as a function of lag distance. First save the family and the initial parameter guesses in a variogram model (vmi) object.
vmi <- vgm(model = "Gau",
psill = 150,
range = 200 * 1000,
nugget = 50)
vmi## model psill range
## 1 Nug 50 0e+00
## 2 Gau 150 2e+05
The psill argument is the partial sill (the difference between the sill and the nugget) along the vertical axis. Estimate the parameter values by looking at the empirical variogram.
Next use the fit.variogram() function to improve the fit over these initial values. Given a set of initial parameter values the method of weighted least squares is used to improve the parameter estimates.
vm <- fit.variogram(object = v,
model = vmi)
vm## model psill range
## 1 Nug 46.58044 0.0
## 2 Gau 156.24464 127724.3
The result is a variogram model with a nugget of 46.6 cm\(^2\), a partial sill of 156 cm\(^2\), and a range on the sill of 128 km.
Plot the model on top of the empirical variogram. Let \(r\) be the range, \(c\) the partial sill and \(c_o\) the nugget, then the equation defining the function over the set of lag distances \(h\) is
\[ \gamma(h)=c\left(1-\exp\left(-\frac{h^2}{r^2}\right)\right)+c_o \]
Create a data frame with values of h and gamma using this equation.
nug <- vm$psill[1]
ps <- vm$psill[2]
r <- vm$range[2] / 1000
h <- seq(0, 400, .2)
gamma <- ps * (1 - exp(-h^2 / (r^2))) + nug
vm.df <- data.frame(dist = h,
gamma = gamma)
pv + geom_line(aes(x = dist, y = gamma), data = vm.df)
Check for anisotropy. Anisotropy refers to a dependence of the variogram shape on the direction of the location pairs used to compute semivariances. Isotropy refers to a directional independence.
plot(variogram(tpm ~ 1,
data = FR.sf,
alpha = c(0, 45, 90, 135),
cutoff = 400000),
xlab = "Lag Distance (m)")
The semivariance values reach the sill at a longer range (about 300 km) in the north-south direction (0 degrees) compared to the other three directions.
Another way to look at directional dependence in the variogram is through a variogram map. Instead of classifying point pairs Z(s) and Z(s + h) by direction and distance class separately, you classify them jointly.
If h = {x, y} is the two-dimensional coordinates of the separation vector, in the variogram map the variance contribution of each point pair (Z(s) − Z(s + h))^2 is attributed to the grid cell in which h lies. The map is centered at (0, 0) and h is lag distance. Cutoff and width correspond to map extent and cell size; the semivariance map is point symmetric around (0, 0), as γ(h) = γ(−h).
The variogram map is made with the variogram() function by adding the map = TRUE argument. Here you set the cutoff to be 200 km (200,000 m) and the width (cell size) to be 20 km.
vmap <- variogram(tpm ~ 1,
data = FR.sf,
cutoff = 200000,
width = 20000,
map = TRUE)
plot(vmap)
The variogram map is centered on dx = 0 and dy = 0. Along the dx = 0 vertical line in the north-south direction (top-to-bottom on the plot) the semivariance values increase away from dy = 0, but the increase is less compared to along the dy = 0 horizontal line in the east-west direction (left-to-right on the plot) indicative of directional dependency.
You refit the variogram model defining an anisotropy ellipse with the anis = argument. The first parameter is the direction of longest range (here north-south) and the second parameter is the ratio of the longest to shortest ranges. Here about (200/300 = .67).
vmi <- vgm(model = "Gau",
psill = 150,
range = 300 * 1000,
nugget = 50,
anis = c(0, .67))
vm <- fit.variogram(v, vmi)Use the variogram model together with the rainfall values at the observation sites to create an interpolated surface. Here you use ordinary kriging as there are no spatial trends in the rainfall.
Interpolation is done using the krige() function. The first argument is the model specification and the second is the data. Two other arguments are needed. One is the variogram model using the argument name model = and the other is a set of locations identifying where the interpolations are to be made. This is specified with the argument name newdata =.
Here you interpolate to locations on a regular grid. You create a grid of locations within the borders of the state using the st_sample() function.
grid.sf <- sf::st_sample(FL.sf,
size = 5000,
type = "regular")You specify the number of grid locations using the argument size =. Note that the actual number of locations will be somewhat different because of the irregular boundary.
First use the krige() function to interpolate the observed rainfall to the grid locations. For a given location, the interpolation is a weighted average of the rainfall across the entire region where the weights are determined by the variogram model.
r.int <- krige(tpm ~ 1,
locations = FR.sf,
newdata = grid.sf,
model = vm)## [using ordinary kriging]
If the variogram model is not included then inverse distance-weighted interpolation is performed. The function will not work if different values share the same location.
The saved object (r.int) inherits the spatial geometry specified in the newdata = argument but extends it to a spatial data frame. The column var1.pred in the data frame is the interpolated rainfall and the second var1.var is the variability about the interpolated value.
Plot the interpolated storm-total rainfall field.
tmap::tm_shape(r.int) +
tmap::tm_dots("var1.pred",
size = .1,
palette = "Greens",
title = "Rainfall (cm)") +
tmap::tm_shape(FL.sf) +
tmap::tm_borders() +
tmap::tm_layout(legend.position = c("left", "bottom"),
title = "TC Fay (2008)",
title.position = c("left", "bottom"),
legend.outside = TRUE)
Note: a portion of the data locations are outside of the state but interest is only interpolated values within the state border as specified by the newdata = argument.
The spatial interpolation shows that parts of east central and north Florida were deluged by Fay with rainfall totals exceeding 30 cm (12 in).
Block kriging
The interpolation can also be done as an area average. For example what was the storm-total average rainfall for each county?
County level rainfall is relevant for water resource managers. Block kriging produces an estimate of this area average, which will differ from a simple average over all sites within the county because of the spatial autocorrelation in rainfall observations.
You use the same function to interpolate but specify the spatial polygons rather than the spatial grid as the new data. Here the spatial polygons are the county borders.
r.int2 <- krige(tpm ~ 1,
locations = FR.sf,
newdata = FL.sf,
model = vm)## [using ordinary kriging]
Again plot the interpolations.
tmap::tm_shape(r.int2) +
tmap::tm_polygons(col = "var1.pred",
palette = "Greens",
title = "Rainfall (cm)") +
tmap::tm_layout(legend.position = c("left", "bottom"),
title = "TC Fay (2008)",
title.position = c("left", "bottom"))
The overall pattern of rainfall from Fay featuring the largest amounts along the central east coast and over the Big Bend region are similar in both maps but these estimates answer questions like on average how much rain fell over Leon County during TC Fay?
You compare the kriged average with the simple average at the county level with the aggregate() method. The argument FUN = mean says to compute the average of the values in FR.sf across the polygons in FL.sf.
r.int3 <- aggregate(FR.sf,
by = FL.sf,
FUN = mean)The result is a simple feature data frame of the average rainfall in each county.
The state-wide mean of the kriged estimates at the county level is
round(mean(r.int2$var1.pred), 2)## [1] 21.1
This compares with a state-wide mean from the simple averages.
round(mean(r.int3$tpm), 2)## [1] 20.85
The correlation between the two estimates across the 67 counties is
round(cor(r.int3$tpm, r.int2$var1.pred), 2)## [1] 0.87
The variogram model reduces the standard deviation of the kriged estimate relative to the standard deviation of the simple averages because of the local smoothing.
round(sd(r.int2$var1.pred), 2)## [1] 8.06
round(sd(r.int3$tpm), 2)## [1] 9.83
This can be seen with a scatter plot of simple averages versus kriged averages at the county level.
compare.df <- data.frame(simpleAvg = r.int3$tpm,
krigeAvg = r.int2$var1.pred)
ggplot(compare.df, aes(x = simpleAvg,
y = krigeAvg)) +
geom_point() +
geom_abline(slope = 1) +
geom_smooth(method = lm, se = FALSE)## `geom_smooth()` using formula 'y ~ x'

An advantage of kriging as a method of spatial interpolation is the accompanying uncertainty estimates. The prediction variances are listed in a column in the spatial data frame saved from apply the krige() function. Variances are smaller in regions with more rainfall observations.
Prediction variances are also smaller with block kriging as much of the variability within the county averages out. To compare the distribution characteristics of the prediction variances for the point and block kriging of the rainfall observations, type
round(summary(r.int$var1.var), 1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.4 49.1 50.2 51.1 52.3 102.5
round(summary(r.int2$var1.var), 1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6 1.9 2.7 3.1 3.8 9.2
The median prediction variance (in cm\(^2\)) for the point kriging is close to the value of the nugget.
round(fivenum(r.int$var1.var)[3], 1)## [1] 50.2
In contrast, the median prediction variance for our block kriging is a much smaller.
round(fivenum(r.int2$var1.var)[3], 1)## [1] 2.7
Simulating spatial fields
Simulations use this uncertainty to provide additional data for deterministic models. Suppose for example you have a hydrology model of rainfall runoff. Given a spatial field of rain amounts the model predicts a discharge rate at some location along a river. The uncertainty in the predicted runoff rate at the location is due to the uncertainty in where and how hard the rain fell (in the rainfall field) and not due to the deterministic hydrology model.
The uncertainty in the rainfall field is simulated conditional on the observations with the same krige() function by adding the argument nsim = that specifies the number of simulations.
For a large number it may be necessary to limit the number neighbors in the kriging. This is done using the nmax argument. For a given location, the weights assigned to observations far away are very small, so it is efficient to limit how many are used in the simulation.
As an example, here you generate four realizations of the county-level average storm total rainfall for Fay and limit the neighborhood to 50 of the closest observation sites. This takes a few seconds.
r.sim <- krige(tpm ~ 1,
locations = FR.sf,
newdata = FL.sf,
model = vm,
nsim = 4,
nmax = 50)## drawing 4 GLS realisations of beta...
## [using conditional Gaussian simulation]
Given the variogram model, the simulations are conditional on the observed rainfall.
tmap::tm_shape(r.sim) +
tmap::tm_polygons(col = c("sim1", "sim2", "sim3", "sim4"),
palette = "Greens",
title = "Simulated Rainfall [cm]") +
tmap::tm_facets(free.scales = FALSE) 
The overall pattern of rainfall remains the same, but there are differences especially in counties with fewer observations and in counties where the rainfall gradients are sharp.
Interpolating multiple variables
Spatial interpolation can be extended to obtain surfaces of multiple variables. The idea is that if two field variables are correlated then information about the spatial correlation in one field variable can help provide information about values in the other field variable. The spatial variability of one variable is correlated with the spatial variability of the other variable. And this idea is not limited to two variables.
Here you consider observations of heavy metal concentrations (ppm) in the top soil in the flood plain of the river Meuse near the village of Stein. The data are available in {sp} package.
library(sp)
data(meuse)
names(meuse)## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
The metals include cadmium, copper, lead, and zinc. Observation locations are given by x and y. Other variables include elevation, soil type and distance to the river.
Create a simple feature data frame with a projected coordinate system for the Netherlands.
meuse.sf <- sf::st_as_sf(x = meuse,
coords = c("x", "y"),
crs = 28992)Interest is on the spatial distribution of all four heavy metals in the soil.
Map the concentrations at the observation locations.
tmap::tmap_mode("view")## tmap mode set to interactive viewing
tmap::tm_shape(meuse.sf) +
tmap::tm_dots(col = c("cadmium", "copper", "lead", "zinc"))All observations (bulk sampled from an area of approximately 15 m x 15 m) have units of ppm. The most abundant heavy metal is zinc followed by lead and copper. For all metals highest concentrations are found nearest to the river. Thus you want to include distance to river as a covariate (trend term) and use universal kriging.
The distribution of concentrations is skewed with many locations having only low levels of heavy metals with a few having very high levels.
ggplot(data = meuse.sf,
mapping = aes(x = lead)) +
geom_histogram(bins = 17) +
theme_minimal()
Thus you use a logarithmic transformation.
First you organize the data as a gstat object. This is done with the gstat() function which orders (and copies) the variables into a single object. Ordering is done successively.
Here you specify the trend using the square root of the distance to river and take the natural logarithm of the heavy metal concentration. You give the dependent variable a new name with the id = argument.
g <- gstat(id = "logCd",
formula = log(cadmium) ~ sqrt(dist),
data = meuse.sf)
g <- gstat(g,
id = "logCu",
formula = log(copper) ~ sqrt(dist),
data = meuse.sf)
g <- gstat(g,
id = "logPb",
formula = log(lead) ~ sqrt(dist),
data = meuse.sf)
g <- gstat(g,
id = "logZn",
formula = log(zinc) ~ sqrt(dist),
data = meuse.sf)
g## data:
## logCd : formula = log(cadmium)`~`sqrt(dist) ; data dim = 155 x 12
## logCu : formula = log(copper)`~`sqrt(dist) ; data dim = 155 x 12
## logPb : formula = log(lead)`~`sqrt(dist) ; data dim = 155 x 12
## logZn : formula = log(zinc)`~`sqrt(dist) ; data dim = 155 x 12
Next you use the variogram() function to compute empirical variograms. The function, when operating on a gstat object, computes all direct and cross variograms.
v <- variogram(g)
plot(v)
The plot method displays the set of direct and cross variograms. The direct variograms are shown in the four panels along the diagonal of the triangle of plots.
The cross variograms are shown in the six panels below the diagonal. For example, the cross variogram between the values of cadmium and copper is given in the second row of the first column and so on.
The cross variogram is analogous to the multi-type \(K\) function for analyzing point pattern data.
The cross variograms show small semivariance values at short lag distance with increasing semivariance values at longer lags. Because these variables are co-located, you can also compute direct correlations.
cor(meuse[c("cadmium", "copper", "lead", "zinc")])## cadmium copper lead zinc
## cadmium 1.0000000 0.9254499 0.7989466 0.9162139
## copper 0.9254499 1.0000000 0.8183069 0.9082695
## lead 0.7989466 0.8183069 1.0000000 0.9546913
## zinc 0.9162139 0.9082695 0.9546913 1.0000000
The direct correlation between cadmium and copper is .92 and between cadmium and lead is .8.
The correlation matrix confirms strong cross correlation among the four variables at zero lag. The cross variogram generalizes these correlations across lag distance. For instance, the cross variogram indicates the strength of the relationship between cadmium at one location and copper at nearby locations.
You use the fit.lmc() function to fit separate variogram models to each of the empirical variograms. You use an initial partial sill of .5, an initial nugget of zero and an initial range of 800 meters.
vm <- fit.lmc(v, g,
vgm(model = "Sph",
psill = .5,
nugget = 0,
range = 800))
plot(v, vm)
The final variogram models (blue line) fit the empirical variogram (direct and cross) well.
Given the variogram models, co-kriged maps are produced using the predict() method after setting the grid locations for the interpolations. The CRS for the grid locations must match the CRS of the data.
data(meuse.grid)
grid.sf <- sf::st_as_sf(x = meuse.grid,
coords = c("x", "y"),
crs = 28992)
hm.int <- predict(vm, grid.sf)## Linear Model of Coregionalization found. Good.
## [using universal cokriging]
names(hm.int)## [1] "logCd.pred" "logCd.var" "logCu.pred" "logCu.var"
## [5] "logPb.pred" "logPb.var" "logZn.pred" "logZn.var"
## [9] "cov.logCd.logCu" "cov.logCd.logPb" "cov.logCu.logPb" "cov.logCd.logZn"
## [13] "cov.logCu.logZn" "cov.logPb.logZn" "geometry"
Plot the interpolated for logarithm of zinc concentration are plotted.
tmap::tmap_mode("plot")## tmap mode set to plotting
tmap::tm_shape(hm.int) +
tmap::tm_dots(col = c("logCd.pred", "logCu.pred", "logPb.pred", "logZn.pred"),
size = .2, breaks = seq(-2, 8, by = 1), palette = "Reds", midpoint = NA)
The pattern of heavy metal concentrations are similar with highest values along the river bank.
Compare with predictions using only cadmium
v2 <- variogram(log(cadmium) ~ sqrt(dist),
data = meuse.sf)
vm2 <- fit.variogram(v2, vgm(psill = .15, model = "Sph",
range = 800, nugget = .1))
int <- krige(log(cadmium) ~ sqrt(dist), meuse.sf, newdata = grid.sf,
model = vm2)## [using universal kriging]
p1 <- tmap::tm_shape(int) +
tmap::tm_dots(col = "var1.pred",
size = .2, palette = "Reds", breaks = seq(-2, 3, by = .5))
p2 <- tmap::tm_shape(hm.int) +
tmap::tm_dots(col = "logCd.pred",
size = .2, palette = "Reds", breaks = seq(-2, 3, by = .5))
tmap::tmap_arrange(p1, p2)## Warning: Breaks contains positive and negative values. Better is to use
## diverging scale instead, or set auto.palette.mapping to FALSE.
## Warning: Breaks contains positive and negative values. Better is to use
## diverging scale instead, or set auto.palette.mapping to FALSE.
## Warning: Breaks contains positive and negative values. Better is to use
## diverging scale instead, or set auto.palette.mapping to FALSE.
## Warning: Breaks contains positive and negative values. Better is to use
## diverging scale instead, or set auto.palette.mapping to FALSE.

cor(hm.int$logCu.pred, int$var1.pred)## [1] 0.913258
Only minor differences are visible on the plot and the correlation between the two interpolations exceeds .9.
Plot the covariances between zinc and cadmium.
tmap::tm_shape(hm.int) +
tmap::tm_dots(col = "cov.logCd.logZn", size = .2)
The map shows areas of the flood plain with high (and low) correlations between cadmium and zinc. Caution: Higher values of the covariance indicate lower correlations. There is an inverse relationship between the correlogram and the covariogram.
Obtaining a quality statistical spatial interpolation is a nuanced process but with practice kriging can be an important tool in your toolbox.
Kriging is useful tool for ‘filling in the gaps’ between sampling sites. Handy if you want to make a map, or need to match up two spatial data sets that overlap in extent, but have samples at different locations.